Add a sorted wordcount to the bottom of all_pod_files_spelling_ok #7

kentfredric · 2014-09-26T04:35:11Z

This makes nuking the most commonly misspelt words easier if you're
working on a large project with a lot of files, by helping you easily
see what needs to be deemed a "stopword" and what needs to be hit with
an edit pass.

This was an an apparent issue when I was adding a pod spelling check to dbix-class, which had 80 subtests worth of spelling failures which was hard to make sense of.

After this patch, it emits the following at the end:

# All wrong words: ActorRoles=1, Authorized=1, AutoCast=1, BelongsTo=1,
#    Bloggs=1, CachedKids=1, Centos=1, Compat=1, DBA=1, DELETEs=1, DEV=1,
#    Easysoft=1, Eg=1, FC=1, FileColumn=1, FirstSkip=1, FromForm=1, HRI=1,
#    INSERTs=1, Inflators=1, LimitXY=1, LiveObjectIndex=1, MRO=1,
#    ManyToMany=1, Microsft=1, Multicreate=1, MyApp=1, NULLs=1,
#    NoObjectIndex=1, Northwind=1, Optimizer=1, Overridability=1,
#    POSTGRESQL=1, Queryable=1, RDBMSes=1, RDMS=1, REPL=1, RESULTSET=1,
#    ReLoad=1, Reblessing=1, ResulSet=1, ResultSetColumn=1, Resultset=1,
#    Retrevial=1, RowNum=1, SQLT=1, SchemaVersions=1, Serializable=1,
#    Stringfy=1, Subquery=1, Subselects=1, Suported=1, TMTOWTDI=1,
#    TxnScopeGuard=1, UPDATEs=1, VARCHAR=1, WhereJoin=1, YYYY=1, albumid=1,
#    artifact=1, authorization=1, autodetected=1, autodetection=1,
#    autodetects=1, autoinc=1, autovalidation=1, blabla=1, bugreport=1,
#    caling=1, callframe=1, catalyze=1, cdbi=1, chocolateboy=1, columnname=1,
#    dbh=1, dclone=1, dcloned=1, dcloning=1, de=1, deallocating=1,
#    deferrable=1, dsn=1, explorative=1, failover=1, gR=1, generalized=1,
#    getter=1, gravatar=1, hundres=1, inflator=1, initialize=1, insertdb=1,
#    instantiation=1, introspectible=1, lifecycle=1, localize=1, logsystem=1,
#    materialized=1, mis=1, mysqld=1, nfreeze=1, nls=1, noindexed=1,
#    numification=1, onwards=1, optimized=1, optimizer=1, organize=1,
#    params=1, qw=1, randomizer=1, readbound=1, rebless=1, recognize=1,
#    reconnectable=1, reimplementation=1, ruleset=1, scalarref=1, se=1,
#    serializable=1, serialize=1, serialized=1, serializing=1, specialized=1,
#    sqlt=1, stacktrace=1, standardized=1, suboptimal=1, superset=1, sys=1,
#    sysdate=1, testdb=1, thusly=1, transactionally=1, txn=1, uncommited=1,
#    uninflated=1, uniqueidentifierstr=1, unixODBC=1, unserializable=1,
#    upto=1, utilize=1, versioning=1, webpages=1, xyz=1, AutoCommit=2,
#    BUILDARGS=2, DBs=2, DESC=2, EasySoft=2, FK=2, FilterColumn=2,
#    HashRefInflator=2, Kioku=2, LimitOffset=2, LinerNotes=2, ORMs=2,
#    PREFETCHING=2, Postgres=2, RHS=2, ResultSets=2, SELECTs=2, SkipFirst=2,
#    Subqueries=2, attr=2, autoincremented=2, autoincrementing=2,
#    balancers=2, callsites=2, colname=2, ddl=2, debugcb=2, fastpath=2,
#    freeform=2, incrementing=2, initialized=2, nextval=2, normalize=2,
#    nullable=2, overwritable=2, postgresql=2, prefetching=2, recognizes=2,
#    reconnection=2, resultsource=2, serialization=2, stringifiable=2,
#    syntaxes=2, uninserted=2, varchar=2, DBDs=3, Extensibility=3,
#    FetchFirst=3, InflateColumn=3, Informix=3, JOINs=3, PKs=3, Replicants=3,
#    RowNumberOver=3, TT=3, attrs=3, behaviors=3, natively=3, prefetches=3,
#    preversion=3, recognized=3, subquery=3, subref=3, subselects=3, unary=3,
#    uniqueidentifier=3, DSNs=4, GenericSubQ=4, RHEL=4, Replicant=4, cond=4,
#    debugfh=4, debugobj=4, overridable=4, reblessed=4, subqueries=4,
#    unversioned=4, ORM=5, ResultSources=5, callsite=5, normalizing=5,
#    classdata=6, eg=6, subselect=6, Firebird=7, prefetched=7, savepoints=8,
#    Balancer=9, DDL=9, LongReadLen=10, balancer=10, resultsets=11,
#    savepoint=13, sql=14, behavior=15, unicode=17, ODBC=23, replicants=42,
#    replicant=43, resultset=163

Which indicates resultset is either a good candidate for a stopword or a universal translation =).

Though "All wrong words" seems awkward and forceful, but I couldn't think of a better suggestion.

dagolden · 2014-09-26T14:12:27Z

Cool! I'd want to see this in descending order, though.

kentfredric · 2014-09-26T15:13:38Z

@xdg, Ascending order was chosen mostly because of how terminal scroll means the most significant are more likely to stay "in screen" at the end of a test run, though I'll gladly flip the orientation if its deemed necessary =)

I could probably make it nicer still by grouping by frequency instead of having a mass of =1.

dagolden · 2014-09-26T15:14:52Z

That's a good point. Most of the time I have only a dozen or so words so it doesn't really matter. So I take it back. I'm neutral on the direction.

kentfredric · 2014-09-26T15:32:58Z

This is the revised format I'm working on, I think its much easier on the eyes:

# All wrong words:
#      1: ACHTUNG, ASC, Analyzes, AoA, BCP, Backcompat, BegunWork, Bowden,
#         Caelum, ColumnCase, DBICTest, DBICs, DBIHacks, DDL, DESC, DTRT,
#         DWIW, De, FC, FIME, FIXUP, FKs, ForceUTF, GCed, GenSubQ, HELEMs,
#         IC, IFF, Informix, JOINs, LOBs, MOAR, MSAccess, MoreUtils, MsSQL,
#         NULLs, Normalize, ORed, ORing, OpenClient, PKs, POC, RDBMSes,
#         RSC, RaiseError, RestrictWithObject, ResultObject,
#         ResultSetColumn, RowNum, Runmode, STRICTMODE, Savepoints,
#         TEXTSIZE, TxnScopeGuard, WHEREs, WHOREIFFIC, al, aliastype,
#         aliastypes, anchore, artistid, asshats, bcp, bindattrs, bindlist,
#         bindtypes, bindvals, blockrunner, brainer, bugwards, bulkLogin,
#         captainL, centered, classdata, codepaths, codition, colinfos,
#         collapsable, collapsers, collist, confient, crosstable, curcuit,
#         dbi, defeaults, deferrable, deflator, deflators, depping,
#         dequalify, dir, disemvowel, eiehter, equalities, fBSD, fc,
#         firstcol, fixup, fsvo, ftds, fucktards, fugly, fulfill, groditi,
#         hotspot, hve, idcols, iff, implmentation, inambiguous, inflators,
#         initializations, interbase, introspectable, ivsize, jnap,
#         joinpath, joinstructure, jpath, keynames, lhs, libs, libsqlite,
#         localization, memleaks, millisecs, minimize, misbehavior, mkpath,
#         multicol, multiplicator, nonbind, normalized, nullable, objlike,
#         param, parameterize, params, parenthesized, parenthesizing,
#         pessimization, pks, pov, prefetches, preload, preloaded,
#         premulti, premultiplication, premultiplies, pruneable,
#         pseudoforking, qsub, rbuels, reblessed, recursing, reenabling,
#         refactor, refcount, refcounts, reinstall, reinvoke, reinvokes,
#         relname, relnames, relobjs, renderer, replicant, replicants,
#         reselect, reselected, rewriter, ro, rownum, rowparser, rset, rv,
#         sanitize, sanitized, scalarref, scrutinize, sensical,
#         serializing, somesuch, soooooo, stabilize, stackable,
#         stringifiable, subquerying, subselects, subst, sucky, swindon,
#         synchronize, synchronizing, tae, temporarly, tempvars, th,
#         toplevelness, unicode, unparseable, unqualify, unresolvable,
#         utilizing, vpp, vxs, waaaaay, wi, wo, wtf, yay, yyyzzz
#      2: CLOB, CursorType, DBA, FK, Firebird, HRI, JF, LobWriter, SEL, Stil,
#         adOpenStatic, analyze, autoinc, behavior, bindtype, cdbi,
#         colinfo, cref, darkpan, deps, dev, dirtyness, dsn, fixups, fk,
#         fs, hotttnesss, idents, inflator, joinmap, localizing, md, moar,
#         multicolumn, multijoins, normalization, nullability,
#         optimization, overridable, prefetched, rdbms, realiased,
#         recognize, reconnection, recursor, refactored, resultsets,
#         serialized, sqlmaker, subselect, uninserted, varchar, weaklink,
#         weakref
#      3: NoBindVars, ODBC, Postgres, RNO, SQLMaker, braindead, colname,
#         conds, dbh, de, freetds, optimizer, proto, rels, riba, rsrc,
#         scalarrefs, signaled, sqlt, sth, unversioned, wrt
#      4: DBDs, SQLT, attr, ctx, dep, premultiplied, resultsource, savepoint,
#         savepoints, sqla
#      5: DQ, dbic, optimized
#      6: codepath, compat, subq, txn
#      7: SUBOPTIMAL
#      8: LongReadLen, sql, subqueries
#      9: AutoCommit, collapser, cond
#     14: attrs
#     17: SQLA
#     22: resultset
#     26: subquery

dagolden · 2014-09-26T15:38:13Z

How about "All incorrect words, by number of occurrences:"

This makes nuking the most commonly misspelt words easier if you're working on a large project with a lot of files, by helping you easily see what needs to be deemed a "stopword" and what needs to be hit with an edit pass.

kentfredric · 2014-09-26T15:55:22Z

Now patched and squashed as follows:

# All incorrect words, by number of occurrences:
#      1: ActorRoles, Authorized, AutoCast, BelongsTo, Bloggs, CachedKids,
#         Centos, Compat, DBA, DELETEs, DEV, Easysoft, Eg, FC, FileColumn,
#         FirstSkip, FromForm, HRI, INSERTs, Inflators, LimitXY,
#         LiveObjectIndex, MRO, ManyToMany, Microsft, Multicreate, MyApp,
#         NULLs, NoObjectIndex, Northwind, Optimizer, Overridability,
#         POSTGRESQL, Queryable, RDBMSes, RDMS, REPL, RESULTSET, ReLoad,
#         Reblessing, ResulSet, ResultSetColumn, Resultset, Retrevial,
#         RowNum, SQLT, SchemaVersions, Serializable, Stringfy, Subquery,
#         Subselects, Suported, TMTOWTDI, TxnScopeGuard, UPDATEs, VARCHAR,
#         WhereJoin, YYYY, albumid, artifact, authorization, autodetected,
#         autodetection, autodetects, autoinc, autovalidation, blabla,
#         bugreport, caling, callframe, catalyze, cdbi, chocolateboy,
#         columnname, dbh, dclone, dcloned, dcloning, de, deallocating,
#         deferrable, dsn, explorative, failover, gR, generalized, getter,
#         gravatar, hundres, inflator, initialize, insertdb, instantiation,
#         introspectible, lifecycle, localize, logsystem, materialized,
#         mis, mysqld, nfreeze, nls, noindexed, numification, onwards,
#         optimized, optimizer, organize, params, qw, randomizer,
#         readbound, rebless, recognize, reconnectable, reimplementation,
#         ruleset, scalarref, se, serializable, serialize, serialized,
#         serializing, specialized, sqlt, stacktrace, standardized,
#         suboptimal, superset, sys, sysdate, testdb, thusly,
#         transactionally, txn, uncommited, uninflated,
#         uniqueidentifierstr, unixODBC, unserializable, upto, utilize,
#         versioning, webpages, xyz
#      2: AutoCommit, BUILDARGS, DBs, DESC, EasySoft, FK, FilterColumn,
#         HashRefInflator, Kioku, LimitOffset, LinerNotes, ORMs,
#         PREFETCHING, Postgres, RHS, ResultSets, SELECTs, SkipFirst,
#         Subqueries, attr, autoincremented, autoincrementing, balancers,
#         callsites, colname, ddl, debugcb, fastpath, freeform,
#         incrementing, initialized, nextval, normalize, nullable,
#         overwritable, postgresql, prefetching, recognizes, reconnection,
#         resultsource, serialization, stringifiable, syntaxes, uninserted,
#         varchar
#      3: DBDs, Extensibility, FetchFirst, InflateColumn, Informix, JOINs,
#         PKs, Replicants, RowNumberOver, TT, attrs, behaviors, natively,
#         prefetches, preversion, recognized, subquery, subref, subselects,
#         unary, uniqueidentifier
#      4: DSNs, GenericSubQ, RHEL, Replicant, cond, debugfh, debugobj,
#         overridable, reblessed, subqueries, unversioned
#      5: ORM, ResultSources, callsite, normalizing
#      6: classdata, eg, subselect
#      7: Firebird, prefetched
#      8: savepoints
#      9: Balancer, DDL
#     10: LongReadLen, balancer
#     11: resultsets
#     13: savepoint
#     14: sql
#     15: behavior
#     17: unicode
#     23: ODBC
#     42: replicants
#     43: replicant
#    163: resultset

( NB. Differences in lists of words is because I'm prototyping my changes on a non-test-spelling codebase, so no need for concern, the list with "resultset" as the most common is Test-Spelling's Pod checks, and the one with 'subquery' is extracted from the comments: https://github.com/kentfredric/dbix-class/blob/topic/spelling.t/xt/commentspelling.t#L57 )

Add a sorted wordcount to the bottom of all_pod_files_spelling_ok

sartak · 2014-09-26T16:35:54Z

Awesome. Thank you @kentfredric! I will ship this in a few days when I return from my trip.

sartak · 2014-10-07T13:49:01Z

POSTing upload for Test-Spelling-0.20.tar.gz to https://pause.perl.org/pause/authenquery

kentfredric force-pushed the master branch from 642bc9a to 655b00d Compare September 26, 2014 15:50

sartak added a commit that referenced this pull request Sep 26, 2014

Merge pull request #7 from kentfredric/master

e9a2a7b

Add a sorted wordcount to the bottom of all_pod_files_spelling_ok

sartak merged commit e9a2a7b into genio:master Sep 26, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a sorted wordcount to the bottom of all_pod_files_spelling_ok #7

Add a sorted wordcount to the bottom of all_pod_files_spelling_ok #7

kentfredric commented Sep 26, 2014

dagolden commented Sep 26, 2014

kentfredric commented Sep 26, 2014

dagolden commented Sep 26, 2014

kentfredric commented Sep 26, 2014

dagolden commented Sep 26, 2014

kentfredric commented Sep 26, 2014

sartak commented Sep 26, 2014

sartak commented Oct 7, 2014

Add a sorted wordcount to the bottom of all_pod_files_spelling_ok #7

Add a sorted wordcount to the bottom of all_pod_files_spelling_ok #7

Conversation

kentfredric commented Sep 26, 2014

dagolden commented Sep 26, 2014

kentfredric commented Sep 26, 2014

dagolden commented Sep 26, 2014

kentfredric commented Sep 26, 2014

dagolden commented Sep 26, 2014

kentfredric commented Sep 26, 2014

sartak commented Sep 26, 2014

sartak commented Oct 7, 2014