Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a sorted wordcount to the bottom of all_pod_files_spelling_ok #7

Merged
merged 1 commit into from Sep 26, 2014

Conversation

kentfredric
Copy link
Contributor

This makes nuking the most commonly misspelt words easier if you're
working on a large project with a lot of files, by helping you easily
see what needs to be deemed a "stopword" and what needs to be hit with
an edit pass.

This was an an apparent issue when I was adding a pod spelling check to dbix-class, which had 80 subtests worth of spelling failures which was hard to make sense of.

After this patch, it emits the following at the end:

# All wrong words: ActorRoles=1, Authorized=1, AutoCast=1, BelongsTo=1,
#    Bloggs=1, CachedKids=1, Centos=1, Compat=1, DBA=1, DELETEs=1, DEV=1,
#    Easysoft=1, Eg=1, FC=1, FileColumn=1, FirstSkip=1, FromForm=1, HRI=1,
#    INSERTs=1, Inflators=1, LimitXY=1, LiveObjectIndex=1, MRO=1,
#    ManyToMany=1, Microsft=1, Multicreate=1, MyApp=1, NULLs=1,
#    NoObjectIndex=1, Northwind=1, Optimizer=1, Overridability=1,
#    POSTGRESQL=1, Queryable=1, RDBMSes=1, RDMS=1, REPL=1, RESULTSET=1,
#    ReLoad=1, Reblessing=1, ResulSet=1, ResultSetColumn=1, Resultset=1,
#    Retrevial=1, RowNum=1, SQLT=1, SchemaVersions=1, Serializable=1,
#    Stringfy=1, Subquery=1, Subselects=1, Suported=1, TMTOWTDI=1,
#    TxnScopeGuard=1, UPDATEs=1, VARCHAR=1, WhereJoin=1, YYYY=1, albumid=1,
#    artifact=1, authorization=1, autodetected=1, autodetection=1,
#    autodetects=1, autoinc=1, autovalidation=1, blabla=1, bugreport=1,
#    caling=1, callframe=1, catalyze=1, cdbi=1, chocolateboy=1, columnname=1,
#    dbh=1, dclone=1, dcloned=1, dcloning=1, de=1, deallocating=1,
#    deferrable=1, dsn=1, explorative=1, failover=1, gR=1, generalized=1,
#    getter=1, gravatar=1, hundres=1, inflator=1, initialize=1, insertdb=1,
#    instantiation=1, introspectible=1, lifecycle=1, localize=1, logsystem=1,
#    materialized=1, mis=1, mysqld=1, nfreeze=1, nls=1, noindexed=1,
#    numification=1, onwards=1, optimized=1, optimizer=1, organize=1,
#    params=1, qw=1, randomizer=1, readbound=1, rebless=1, recognize=1,
#    reconnectable=1, reimplementation=1, ruleset=1, scalarref=1, se=1,
#    serializable=1, serialize=1, serialized=1, serializing=1, specialized=1,
#    sqlt=1, stacktrace=1, standardized=1, suboptimal=1, superset=1, sys=1,
#    sysdate=1, testdb=1, thusly=1, transactionally=1, txn=1, uncommited=1,
#    uninflated=1, uniqueidentifierstr=1, unixODBC=1, unserializable=1,
#    upto=1, utilize=1, versioning=1, webpages=1, xyz=1, AutoCommit=2,
#    BUILDARGS=2, DBs=2, DESC=2, EasySoft=2, FK=2, FilterColumn=2,
#    HashRefInflator=2, Kioku=2, LimitOffset=2, LinerNotes=2, ORMs=2,
#    PREFETCHING=2, Postgres=2, RHS=2, ResultSets=2, SELECTs=2, SkipFirst=2,
#    Subqueries=2, attr=2, autoincremented=2, autoincrementing=2,
#    balancers=2, callsites=2, colname=2, ddl=2, debugcb=2, fastpath=2,
#    freeform=2, incrementing=2, initialized=2, nextval=2, normalize=2,
#    nullable=2, overwritable=2, postgresql=2, prefetching=2, recognizes=2,
#    reconnection=2, resultsource=2, serialization=2, stringifiable=2,
#    syntaxes=2, uninserted=2, varchar=2, DBDs=3, Extensibility=3,
#    FetchFirst=3, InflateColumn=3, Informix=3, JOINs=3, PKs=3, Replicants=3,
#    RowNumberOver=3, TT=3, attrs=3, behaviors=3, natively=3, prefetches=3,
#    preversion=3, recognized=3, subquery=3, subref=3, subselects=3, unary=3,
#    uniqueidentifier=3, DSNs=4, GenericSubQ=4, RHEL=4, Replicant=4, cond=4,
#    debugfh=4, debugobj=4, overridable=4, reblessed=4, subqueries=4,
#    unversioned=4, ORM=5, ResultSources=5, callsite=5, normalizing=5,
#    classdata=6, eg=6, subselect=6, Firebird=7, prefetched=7, savepoints=8,
#    Balancer=9, DDL=9, LongReadLen=10, balancer=10, resultsets=11,
#    savepoint=13, sql=14, behavior=15, unicode=17, ODBC=23, replicants=42,
#    replicant=43, resultset=163

Which indicates resultset is either a good candidate for a stopword or a universal translation =).

Though "All wrong words" seems awkward and forceful, but I couldn't think of a better suggestion.

@dagolden
Copy link

Cool! I'd want to see this in descending order, though.

@kentfredric
Copy link
Contributor Author

@xdg, Ascending order was chosen mostly because of how terminal scroll means the most significant are more likely to stay "in screen" at the end of a test run, though I'll gladly flip the orientation if its deemed necessary =)

I could probably make it nicer still by grouping by frequency instead of having a mass of =1.

@dagolden
Copy link

That's a good point. Most of the time I have only a dozen or so words so it doesn't really matter. So I take it back. I'm neutral on the direction.

@kentfredric
Copy link
Contributor Author

This is the revised format I'm working on, I think its much easier on the eyes:

# All wrong words:
#      1: ACHTUNG, ASC, Analyzes, AoA, BCP, Backcompat, BegunWork, Bowden,
#         Caelum, ColumnCase, DBICTest, DBICs, DBIHacks, DDL, DESC, DTRT,
#         DWIW, De, FC, FIME, FIXUP, FKs, ForceUTF, GCed, GenSubQ, HELEMs,
#         IC, IFF, Informix, JOINs, LOBs, MOAR, MSAccess, MoreUtils, MsSQL,
#         NULLs, Normalize, ORed, ORing, OpenClient, PKs, POC, RDBMSes,
#         RSC, RaiseError, RestrictWithObject, ResultObject,
#         ResultSetColumn, RowNum, Runmode, STRICTMODE, Savepoints,
#         TEXTSIZE, TxnScopeGuard, WHEREs, WHOREIFFIC, al, aliastype,
#         aliastypes, anchore, artistid, asshats, bcp, bindattrs, bindlist,
#         bindtypes, bindvals, blockrunner, brainer, bugwards, bulkLogin,
#         captainL, centered, classdata, codepaths, codition, colinfos,
#         collapsable, collapsers, collist, confient, crosstable, curcuit,
#         dbi, defeaults, deferrable, deflator, deflators, depping,
#         dequalify, dir, disemvowel, eiehter, equalities, fBSD, fc,
#         firstcol, fixup, fsvo, ftds, fucktards, fugly, fulfill, groditi,
#         hotspot, hve, idcols, iff, implmentation, inambiguous, inflators,
#         initializations, interbase, introspectable, ivsize, jnap,
#         joinpath, joinstructure, jpath, keynames, lhs, libs, libsqlite,
#         localization, memleaks, millisecs, minimize, misbehavior, mkpath,
#         multicol, multiplicator, nonbind, normalized, nullable, objlike,
#         param, parameterize, params, parenthesized, parenthesizing,
#         pessimization, pks, pov, prefetches, preload, preloaded,
#         premulti, premultiplication, premultiplies, pruneable,
#         pseudoforking, qsub, rbuels, reblessed, recursing, reenabling,
#         refactor, refcount, refcounts, reinstall, reinvoke, reinvokes,
#         relname, relnames, relobjs, renderer, replicant, replicants,
#         reselect, reselected, rewriter, ro, rownum, rowparser, rset, rv,
#         sanitize, sanitized, scalarref, scrutinize, sensical,
#         serializing, somesuch, soooooo, stabilize, stackable,
#         stringifiable, subquerying, subselects, subst, sucky, swindon,
#         synchronize, synchronizing, tae, temporarly, tempvars, th,
#         toplevelness, unicode, unparseable, unqualify, unresolvable,
#         utilizing, vpp, vxs, waaaaay, wi, wo, wtf, yay, yyyzzz
#      2: CLOB, CursorType, DBA, FK, Firebird, HRI, JF, LobWriter, SEL, Stil,
#         adOpenStatic, analyze, autoinc, behavior, bindtype, cdbi,
#         colinfo, cref, darkpan, deps, dev, dirtyness, dsn, fixups, fk,
#         fs, hotttnesss, idents, inflator, joinmap, localizing, md, moar,
#         multicolumn, multijoins, normalization, nullability,
#         optimization, overridable, prefetched, rdbms, realiased,
#         recognize, reconnection, recursor, refactored, resultsets,
#         serialized, sqlmaker, subselect, uninserted, varchar, weaklink,
#         weakref
#      3: NoBindVars, ODBC, Postgres, RNO, SQLMaker, braindead, colname,
#         conds, dbh, de, freetds, optimizer, proto, rels, riba, rsrc,
#         scalarrefs, signaled, sqlt, sth, unversioned, wrt
#      4: DBDs, SQLT, attr, ctx, dep, premultiplied, resultsource, savepoint,
#         savepoints, sqla
#      5: DQ, dbic, optimized
#      6: codepath, compat, subq, txn
#      7: SUBOPTIMAL
#      8: LongReadLen, sql, subqueries
#      9: AutoCommit, collapser, cond
#     14: attrs
#     17: SQLA
#     22: resultset
#     26: subquery

@dagolden
Copy link

How about "All incorrect words, by number of occurrences:"

This makes nuking the most commonly misspelt words easier if you're
working on a large project with a lot of files, by helping you easily
see what needs to be deemed a "stopword" and what needs to be hit with
an edit pass.
@kentfredric
Copy link
Contributor Author

Now patched and squashed as follows:

# All incorrect words, by number of occurrences:
#      1: ActorRoles, Authorized, AutoCast, BelongsTo, Bloggs, CachedKids,
#         Centos, Compat, DBA, DELETEs, DEV, Easysoft, Eg, FC, FileColumn,
#         FirstSkip, FromForm, HRI, INSERTs, Inflators, LimitXY,
#         LiveObjectIndex, MRO, ManyToMany, Microsft, Multicreate, MyApp,
#         NULLs, NoObjectIndex, Northwind, Optimizer, Overridability,
#         POSTGRESQL, Queryable, RDBMSes, RDMS, REPL, RESULTSET, ReLoad,
#         Reblessing, ResulSet, ResultSetColumn, Resultset, Retrevial,
#         RowNum, SQLT, SchemaVersions, Serializable, Stringfy, Subquery,
#         Subselects, Suported, TMTOWTDI, TxnScopeGuard, UPDATEs, VARCHAR,
#         WhereJoin, YYYY, albumid, artifact, authorization, autodetected,
#         autodetection, autodetects, autoinc, autovalidation, blabla,
#         bugreport, caling, callframe, catalyze, cdbi, chocolateboy,
#         columnname, dbh, dclone, dcloned, dcloning, de, deallocating,
#         deferrable, dsn, explorative, failover, gR, generalized, getter,
#         gravatar, hundres, inflator, initialize, insertdb, instantiation,
#         introspectible, lifecycle, localize, logsystem, materialized,
#         mis, mysqld, nfreeze, nls, noindexed, numification, onwards,
#         optimized, optimizer, organize, params, qw, randomizer,
#         readbound, rebless, recognize, reconnectable, reimplementation,
#         ruleset, scalarref, se, serializable, serialize, serialized,
#         serializing, specialized, sqlt, stacktrace, standardized,
#         suboptimal, superset, sys, sysdate, testdb, thusly,
#         transactionally, txn, uncommited, uninflated,
#         uniqueidentifierstr, unixODBC, unserializable, upto, utilize,
#         versioning, webpages, xyz
#      2: AutoCommit, BUILDARGS, DBs, DESC, EasySoft, FK, FilterColumn,
#         HashRefInflator, Kioku, LimitOffset, LinerNotes, ORMs,
#         PREFETCHING, Postgres, RHS, ResultSets, SELECTs, SkipFirst,
#         Subqueries, attr, autoincremented, autoincrementing, balancers,
#         callsites, colname, ddl, debugcb, fastpath, freeform,
#         incrementing, initialized, nextval, normalize, nullable,
#         overwritable, postgresql, prefetching, recognizes, reconnection,
#         resultsource, serialization, stringifiable, syntaxes, uninserted,
#         varchar
#      3: DBDs, Extensibility, FetchFirst, InflateColumn, Informix, JOINs,
#         PKs, Replicants, RowNumberOver, TT, attrs, behaviors, natively,
#         prefetches, preversion, recognized, subquery, subref, subselects,
#         unary, uniqueidentifier
#      4: DSNs, GenericSubQ, RHEL, Replicant, cond, debugfh, debugobj,
#         overridable, reblessed, subqueries, unversioned
#      5: ORM, ResultSources, callsite, normalizing
#      6: classdata, eg, subselect
#      7: Firebird, prefetched
#      8: savepoints
#      9: Balancer, DDL
#     10: LongReadLen, balancer
#     11: resultsets
#     13: savepoint
#     14: sql
#     15: behavior
#     17: unicode
#     23: ODBC
#     42: replicants
#     43: replicant
#    163: resultset

( NB. Differences in lists of words is because I'm prototyping my changes on a non-test-spelling codebase, so no need for concern, the list with "resultset" as the most common is Test-Spelling's Pod checks, and the one with 'subquery' is extracted from the comments: https://github.com/kentfredric/dbix-class/blob/topic/spelling.t/xt/commentspelling.t#L57 )

sartak added a commit that referenced this pull request Sep 26, 2014
Add a sorted wordcount to the bottom of all_pod_files_spelling_ok
@sartak sartak merged commit e9a2a7b into genio:master Sep 26, 2014
@sartak
Copy link
Collaborator

sartak commented Sep 26, 2014

Awesome. Thank you @kentfredric! I will ship this in a few days when I return from my trip.

@sartak
Copy link
Collaborator

sartak commented Oct 7, 2014

POSTing upload for Test-Spelling-0.20.tar.gz to https://pause.perl.org/pause/authenquery

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants