# D&D Dice Fraud Detection Software

## 0. Global Imports

Please keep this area tidy. Import only what you need and big heavy stuff like sklearn / TensorFlow towards the end so that we don't run too quickly into Environment problems.

In [27]:
import numpy as np
import pandas as pd

## 1. Generate Real Data

Real data is generated by the formula as printed in the book: 4d6dl1 i.e. the sum of four random numbers in (1,6) of which the lowest is dropped

In [30]:
N_real=50000
N_fake=10000

In [33]:
def FourSixSidedDiceDropLowest():
    randnums= np.random.randint(1,7,4)
    return sum(np.delete(np.sort(randnums),0))

realSample=np.array([[FourSixSidedDiceDropLowest() for i in range(6)] for i in range(N_real)])

## 2. Generate different types of faked data

We must use a great variety of formulae or algorithms in order to present faked data.
Main suggestions:
* More dice, more dropped: 5d6dl2 will make the average go up while keeping the upper bound the same. Up to and including (x+1)d20dlx to make it very obvious
* Smaller dice: 6d4dl2 has lower upper bound as 4d6dl1 but a higher average
* Drop fewer dice: more extreme outliers can be generated by removing fewer dice. This pushes the upper bound beyond 18, which is obviously impossible, but the algorithm should see it to know. 
* Roll dice set multiple times and keep the "best": high-average, low-variance stat sets are infrequent but manifestly better. An example ''goodness indicator'' is the following formula: mu / (5 + sigma)
* For good measure, ''handicap'' sets should also be given, so that the algorithm doesn't automatically think any statistical upper outlier is a cheat

In [26]:
#This function implements any arbitrary XdYdlZ formula. All dice are assumed identical.
def numbersizedrop(number,size,drop):
    randnums= np.random.randint(1,size+1,number)
    return sum(np.delete(np.sort(randnums),range(drop)))

#This implements the "get lucky" algorithm whereby we keep rolling until
#we find a good statistical outlier with high average and low variance
def keeprolling(number,size,drop,targetcoeff):
    coeff=0
    i=0
    while coeff < targetcoeff:
        tstats= np.array([numbersizedrop(number,size,drop) for i in range(6)])
        i+=1
        if coeff < 2*np.mean(tstats) / (5+ np.std(tstats)):
            coeff =2*np.mean(tstats) / (5+ np.std(tstats))
            stats=tstats
            #print('New candidate at step {2}: coeff. {0} with stats {1}'.format(coeff, stats, i))
    #print('Final candidate at step {2}, coeff {0} with stats {1}'.format(coeff, stats,i))
    return stats    
        
        
    

In [73]:
#We use a concatenation of ndarrays so that the code is somewhat scalable.
#Adding a new generation method is then simply a case of pasting another block.
#This also makes converting the data easier since we can use an enumerator.
fakeSample=np.array([[[numbersizedrop(5,6,2) for i in range(6)] for i in range(N_fake)]])

fakeSample=np.concatenate(
    (
        fakeSample,
         np.array([[[numbersizedrop(6,4,2) for i in range(6)] for i in range(N_fake)]])
    )
    ,axis=0)

fakeSample=np.concatenate(
    (
        fakeSample,
         np.array([[[numbersizedrop(6,20,5) for i in range(6)] for i in range(N_fake)]])
    )
    ,axis=0)

fakeSample=np.concatenate(
    (
        fakeSample,
         np.array([[[numbersizedrop(3,20,2) for i in range(6)] for i in range(N_fake)]])
    )
    ,axis=0)

fakeSample=np.concatenate(
    (
        fakeSample,
         np.array([[[numbersizedrop(4,6,2) for i in range(6)] for i in range(N_fake)]])
    )
    ,axis=0)

fakeSample=np.concatenate(
    (
        fakeSample,
         np.array([[[numbersizedrop(1,20,0) for i in range(6)] for i in range(N_fake)]])
    )
    ,axis=0)

fakeSample=np.concatenate(
    (
        fakeSample,
         np.array([[[numbersizedrop(4,4,0) for i in range(6)] for i in range(N_fake)]])
    )
    ,axis=0)

fakeSample=np.concatenate(
    (
        fakeSample,
         np.array([[[numbersizedrop(2,8,0) for i in range(6)] for i in range(N_fake)]])
    )
    ,axis=0)


In [74]:
np.shape(fakeSample)[0]

8

## 3. Assemble data in Pandas DataFrames

A Pandas DataFrame object is most naturally suited for statistical analysis and regression. Note that we should keep the Real and Faked data separate until the end of preprocessing in order to balance the data.

In [35]:
realDataRaw=pd.DataFrame(realSample, columns=['1','2','3','4','5','6'])
realDataRaw.describe()
#describe() explores the statistics of the DataFrame, very useful to compare the various generation methods

Unnamed: 0,1,2,3,4,5,6
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,12.232,12.25114,12.24936,12.25624,12.2455,12.24856
std,2.844169,2.839104,2.835832,2.838926,2.8485,2.842059
min,3.0,3.0,3.0,3.0,3.0,3.0
25%,10.0,10.0,10.0,10.0,10.0,10.0
50%,12.0,12.0,12.0,12.0,12.0,12.0
75%,14.0,14.0,14.0,14.0,14.0,14.0
max,18.0,18.0,18.0,18.0,18.0,18.0


In [82]:
fakeDataLists=[
    pd.DataFrame(fakeSample[i], columns=['1','2','3','4','5','6']) for i in range(np.shape(fakeSample)[0])
]
#DataFrames are necessarily two-dimensional, so to start with we make one frame per method for fake data. 
#This allows us to explore the statistics of the fake data before fusing it all together.

In [85]:
fakeDataLists[0].describe()

Unnamed: 0,1,2,3,4,5,6
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,13.4451,13.3879,13.4057,13.4517,13.397,13.4046
std,2.580243,2.593628,2.631083,2.613035,2.606505,2.606104
min,4.0,3.0,3.0,4.0,3.0,3.0
25%,12.0,12.0,12.0,12.0,12.0,12.0
50%,14.0,14.0,14.0,14.0,14.0,14.0
75%,15.0,15.0,15.0,15.0,15.0,15.0
max,18.0,18.0,18.0,18.0,18.0,18.0


In [107]:
fakeDataRaw=pd.concat(
    fakeDataLists 
    #,keys=[i+1 for i in range(np.shape(fakeSample)[0])
    ,ignore_index=True
          )
#pd.concat takes an optional list of keys to create a hierarchy of column values.
#Alternatively, flatten all the data together with the ignore_index optional argument, better suited for processing.

Now would be a good time to save the raw data in a file

In [109]:
realDataRaw.to_csv('realDataRaw.csv')
fakeDataRaw.to_csv('fakeDataRaw.csv')

## 4. Preprocess data

We must perform a variety of operations on the data before it is analysed by the ML algorithm.
* Feature scaling: remove mean and scale by variance to settle the data on an even footing
* Randomising: Shuffling the data inside the DataFrames in order to remove correlations
* Balancing: Ideally the ML algorithm has about as many data points in each category when trying to sort categorical or binary data, which in this case means aiming for a 50% true/false split. It is easiest to perform this operation here, since we must come up with many different ways of generating fake data, it will be tedious to control how much of the fake data we generate.
* Merge: both sets need to be in one DataFrame by the end of it, and we shuffle them once more
* Train/Validate/Test split: Neural Networks have a natural tendency to completely overfit the data we give them. A part of the data is reserved for validation and testing. Fitting stops when Validation loss stops decreasing. 
* Batching: For increased speed, update coefficients based only on small batches of the training data each time

### Scaling

### Randomise

### Balancing

### Merge

### Train/Validate/Test split

Now would also be a good time to saw the preprocessed data

## 5. Define Model

## 6. Train and Validate Model

**DO NOT** I repeat ***DO NOT*** fiddle with the metaparameters to improve test data accuracy **WITHOUT** generating a new set of data samples. You WILL overfit your mum.