Skip to content

CityPay's PAN search is a Scala/Java sensitive data discovery tool for searching locations for the existence of sensitive primary cardholder data.

License

Notifications You must be signed in to change notification settings

citypay/citypay-pan-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CityPay PCI PAN Search

Build Status

CityPay's PAN search is a Scala/Java tool for searching locations for the existence of sensitive primary cardholder data.

The scanner is able to scan multiple sources such as

  • Multiple Operating Systems including Microsoft ® Windows 32/64, Mac OSX 32/64, Linux 32/64, Arm
  • Local Filesystems
  • Databases (via JDBC drivers such as MySql, SQLServer, Oracle etc)
  • WebSites (Future Planned)
  • Archives, catering for compression streams such as 7-Zip, Zip, Rar, Tar, Split, Lzma, Iso, HFS, GZip, Cpio, BZip2, Z, Arj, Chm, Lhz, Cab, Nsis, Deb, Rpm

Compliance Information

Note that all card holder data including PANs within unit tests supplied with this source code are test data and generated with a valid luhn checksum.

Pre-requisites

To run the search tool you will need to ensure a valid Java 8 runtime is installed on the system.

Installation and Getting Started

The scanner can create a deployment package by using sbt-native-packager and running one of the following

  • Generate a deployment zip sbt universal:packageBin
  • Generate a deployment tarball sbt universal:packageZipTarball
  • Generate a deployment Xz tarball sbt universal:packageXzTarball

To run the scanner, extract the deployment archive to a directory (CP_SEARCH_ROOT). The directories should appear

  1. bin contains binary files bin/citypay-pan-search and bin/citypay-pan-search.bat for executing the bin search
  2. conf contains configuration files as listed below
  3. lib contains library files used by the pan search process

Once running, execute the binary file bin/citypay-pan-search and you will be presented with a console progress bar scanner-svc-0|========================> | A: 61400 | C: 61400 | M: 1520 | 0100%

Configuration Options

Configuration is found in 3 core logging files search.conf, chd.conf and scanner.conf

Search Configuration

Parameter Data Type Description
search object holder for search configuration
search.source object holder for source configurations
type string The type of search based on ScanSourceConfigFactory such as file, db

File Searching

Parameter Data Type Description
type string "file"
root string[] array of root directories to search
includeHiddenFiles boolean whether hidden files are scanned. Hidden files are either marked as hidden by the file system or start with "."
recursive boolean whether to recursively scan a root directory. Default value is true
maxDepth int the maximum number of directories to scan into

Database Searching

Database searching requires a valid driver which must be provided in the classpath

Parameter Data Type Description
type string "db"
driver string The class name of the jdbc driver to use i.e. "com.mysql.jdbc.Driver"
url string The connection url string to use to provide a connection to the database, Consult the driver documentation for details of examples and configuration of connection strings
schema string The database schema to connect to
tableNameRegex string A regular expression of the table names to scan. This will be dependant on your database setup
colNameRegex string A regular expression of column names to scan
credentials string A class implementation of a com.citypay.pan.search.util.CredentialsCallback preconfigured examples are com.citypay.pan.search.util.NoOpCredentials which does not provide credentials that may be previously supplied in the connection url. com.citypay.pan.search.util.CommandLineCredentials provide a callback function to the command line to enter the username and password at the time of the scan.

Scanner Configuration

Parameter Data Type Description
analaysisBufferSz int The size of the buffer to use when reading from a file, defaults to 16384
concurrentScans int The number of concurrent scan processes to run. A scan process is a search configuration such as file system search, database search etc
scanArchives boolean Whether to scan archive files by expanding them for analysis in RAM or temp files.
scanWorkers int The number of concurrent file scans to perform within a concurrent scan.
stopOnFirstMatch boolean If a value is found the search will exit the file, reporting that an item was found. Any further items will not be listed. If the value is false, the entire file will be searched and all items reported
renderer string A renderer to produce a report by, defaults to "json"
prettify boolean A boolean value whether the renderer should pretty print which is easier for human reading

Card Holder Data Configuration

Parameter Data Type Description
chd.level1 chd object Contains an array of chd specs for analysis at a level 1 stage. Should be short digits for fast searching. Recommended as 1 or 2 digits
chd.level2 chd object Contains an array of chd specs for analysis at a level 2 stage. Should be longer than level1 by 1 or more positions. The more accurate these values, the less false positives are found
chd.falsePositives string array Contains an array of false positive strings which are checked at the point that a match is found. Examples are 2222222222222222

CHD Specs

Parameter Data Type Description
name string The name of the card scheme spec
id string An id of the spec to identify by, will group via the id when testing for schemes
logo string The name of a logo file associated with the scheme
len string The length of the scheme provided as a single value of a range of values specified with a dash such as 16 or 16-19
bins int[] An array of ints that form the leading values for analysis

Scanning Algorithm

The scanner uses an algorithm to assess whether the file contains sensitive data with the purpose to scan against multiple character sets and encrypted files.

Character Definition

  • Scan characters are defined as numeric groups (N) [\u0030-\u0039] with delimiters of [\u0020\u002D\u0009] being space, comma and tab. In PAN storage, this would find examples such as 4000000000000002, 4000-0000-0000-0002, 4000 0000 0000 0002. A combination of both characters are labelled as D

Input Stream analysis

Data is collated and analysed from an input stream buffer of analysisBufferSz bytes. If a N digit is found it is appended to the inspection buffer and its index position marked. If the next character is "out of bounds" being a non D character, the stream is reset. Should the next character match it is appended and the overall length of the inspection buffer vs the minimum pan length is checked. If we are in an inspection range, we move to an inspection process. This manner of providing multiple buffers aids in preventing over analysis of data.

PAN Inspection

Pan inspection analyses the incoming bytes and collates bytes based on matching characters in D until a non D character occurs or a buffer limit is obtained. In a data file it is expected that delimiters such as ',\n' etc will divide items accordingly, to prevent overflows a buffer limit is included.

Once enough data is buffered for analysis, we use a modified version of the Knuth Morris Pratt algorithm (KMP).

An example is to run the algorithm as W = "45NNNNNNNNNNNNNN" where N and S = "4514500000000000000000012325", N is a numeric value with a length of 16.

  • m, denoting the position within S where the prospective match for W begins,
  • i, denoting the index of the currently considered character in W

In each step the algorithm compares S[m+i] with W[i] and increments i if they are equal. This is illustrated at the start, such as:

             1         2            3         4
m: 0123456789012345678901201234567890123456789012
S: 4514500000000000000000012325
W: 45NNNNNNNNNNNNNN
i: 0123456789012345
                  L

At the point that the required number of digits are found, a luhn check is performed to confirm that it is indeed a card number. Should the luhn check fail, the algorithm either continues (if there is a range in length i.e. 16-19)

             1         2            3         4
m: 0123456789012345678901201234567890123456789012
S: 4514500000000000000000012325
W: 45NNNNNNNNNNNNNNNNN
i: 0123456789012345678
                   LLL

If no match is found, it searches from the next instance, where m = 3.

             1         2            3         4
m: 0123456789012345678901201234567890123456789012
S: 4514500000000000000000012325
W:    45NNNNNNNNNNNNNN
i:    0123456789012345
                     L

For further information see

  1. https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
  2. http://algs4.cs.princeton.edu/53substring/KMP.java.html

Troubleshooting

The scanner provides full trace logs for analysis of its processes and can be enabled by providing a java system property of pan.search.trace when running the scan. It can be run in multiple combinations set out in the table below, delimited with a comma or by setting the value to all e.g. java com.citypay.pan.sesarch.Scanner -Dpan.search.trace=all Any trace logging will be sent to System.out by default.

Key Description
analysisBuffer trace logging for the analysis buffer which is run when searching the buffer for card data. The analysis buffer is run on level 1 pans only
analysisBufferL2 trace logging for the abalysis buffer which is run on a level 2 search, after level 1 has found results.
db trace logging for database column searches
inspectionScanner used on the inspection buffer to analysis the inspection routines

Reporting

The scanner outputs a JSON file containing the results of the scan which can be loaded via HTML to construct working reports to meet compliance.

About

CityPay's PAN search is a Scala/Java sensitive data discovery tool for searching locations for the existence of sensitive primary cardholder data.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages