Here are some basic experiments to check in what scenario low rank approximation may work well for error detection and in what scenario it may fail. The intuition at beginning is that it may work well in linear cases but may fail in nonlinear cases. 

Let's import some basic libraries first.

In [111]:
import numpy as np
import pandas as pd
import random

def low_rank_approximation(data, k):
    '''
    A simple function to compute low_rank approximation with rank specified by the second parameter
    '''
    u, s, v = np.linalg.svd(data, full_matrices=False)
    low_rank_diag = np.diag(s[:k])
    low_rank_u = u[:, :k]
    low_rank_v = v[:k, :]
    
    low_rank_approx = np.matmul(np.matmul(low_rank_u, low_rank_diag), low_rank_v)
    
    return low_rank_approx

def show_difference(data, k):
    '''
    A simple function to compute the difference between original matrix and its low rank approximation. 
    For the sake of visualization, returns a pd.DataFrame
    '''
    approx = low_rank_approximation(data, k)

    diff_matrix = abs(data - approx)
    df = pd.DataFrame(diff_matrix)
    return df
    

Now, let's start with a simple base column, which is simply integers from 0 to 9. For this base column, we are to test the behavior of low rank approximation with some other columns, which are specified below.

In [112]:
# base column
base_col = np.array(range(0, 10))

# testing column: addition
add_col = base_col + 20
# Put an erroneous entry in add_col
loc = random.randint(0, 9)
add_col[loc] = 1000
print('The value at %d is set to be erroneous' % loc)

test_data1 = np.column_stack((base_col, add_col))
result = show_difference(test_data1, 1) # k is set to 1 simply because here we only have 2 columns
print(result)

The value at 2 is set to be erroneous
          0         1
0  0.062487  0.000195
1  0.934379  0.002919
2  1.124349  0.003513
3  2.928111  0.009148
4  3.924977  0.012263
5  4.921843  0.015378
6  5.918709  0.018492
7  6.915575  0.021607
8  7.912441  0.024721
9  8.909307  0.027836


From this simple experiment, we can see at desire location, values in the matrix almost have no difference from other entries in both column, which is consistent to our setting. Now, let's try to increase the size of data.

In [113]:
# base column
base_col = np.array(range(0, 50))

# testing column: addition
add_col = base_col + 20
# Put an erroneous entry in add_col
loc = random.randint(0, 50)
add_col[loc] = 1000
print('The value at %d is set to be erroneous' % loc)

test_data1 = np.column_stack((base_col, add_col))
result = show_difference(test_data1, 1) # k is set to 1 simply because here we only have 2 columns
print(result)

The value at 42 is set to be erroneous
            0         1
0    1.922263  0.186494
1    1.027700  0.099705
2    0.133138  0.012917
3    0.761424  0.073872
4    1.655986  0.160660
5    2.550548  0.247449
6    3.445110  0.334237
7    4.339673  0.421025
8    5.234235  0.507814
9    6.128797  0.594602
10   7.023359  0.681391
11   7.917921  0.768179
12   8.812484  0.854968
13   9.707046  0.941756
14  10.601608  1.028545
15  11.496170  1.115333
16  12.390732  1.202121
17  13.285294  1.288910
18  14.179857  1.375698
19  15.074419  1.462487
20  15.968981  1.549275
21  16.863543  1.636064
22  17.758105  1.722852
23  18.652668  1.809641
24  19.547230  1.896429
25  20.441792  1.983218
26  21.336354  2.070006
27  22.230916  2.156794
28  23.125479  2.243583
29  24.020041  2.330371
30  24.914603  2.417160
31  25.809165  2.503948
32  26.703727  2.590737
33  27.598289  2.677525
34  28.492852  2.764314
35  29.387414  2.851102
36  30.281976  2.937890
37  31.176538  3.024679
38  32.071100  3.111467
3

Now, we can see some dramatic difference at the desire location from other entries. Let's now try to add a noise column to the matrix.

In [114]:
# Create a noise column with random numbers and add it to matrix
noise_col = np.array(random.sample(range(50), 50))
test_data2 = np.column_stack((test_data1, noise_col))

# Now, redo the low rank approximation
# We try two values for k because here we have three columns
result_one = show_difference(test_data2, 1)
result_two = show_difference(test_data2, 2)
print('Here is the result with k = 1')
print(result_one)
print('Here is the result with k = 2')
print(result_two)

Here is the result with k = 1
            0         1          2
0    2.180812  2.133357  28.153924
1    1.157942  0.901244  12.173284
2    0.248239  0.817688  10.096846
3    0.354293  3.851644  45.760386
4    1.441830  1.963213  21.834487
5    2.303032  2.371895  25.716993
6    3.390570  0.483464   1.791094
7    3.952685  3.927619  42.420421
8    4.878554  3.679983  38.357669
9    5.974175  1.709513  13.438613
10   6.689875  3.594911  35.197950
11   7.534910  4.167674  41.066771
12   8.388028  4.658396  45.942434
13   9.330064  4.246681  39.893367
14  10.296350  3.588847  30.864828
15  11.278803  2.766933  19.849975
16  12.253173  2.027059   9.828278
17  13.251793  1.041065   3.172890
18  14.088744  1.695867   3.689088
19  14.820611  3.417187  23.462111
20  15.786897  2.759352  14.433572
21  16.543015  4.234552  31.227123
22  17.347632  5.217513  42.061730
23  18.524088  2.426645   7.211101
24  19.498457  1.686771   2.810595
25  20.197991  3.736249  20.935057
26  20.913691  5.621648  

Here, we can see that the choice of k actually matters. When k is chosen to be 1, we can still observe the difference at desired location, however, when k=2, we can barely see the difference. This gives us one direction of work: __choice of k__.

Now, let's do another one experiment with noise.

In [115]:
noise_col = np.array(random.sample(range(50), 50))
test_data3 = np.column_stack((test_data2, noise_col))

# Now, redo the low rank approximation
# We try two values for k because here we have three columns
result_one = show_difference(test_data3, 1)
result_two = show_difference(test_data3, 2)
print('Here is the result with k = 1')
print(result_one)
print('Here is the result with k = 2')
print(result_two)

Here is the result with k = 1
            0          1          2          3
0    2.457681   4.468682  27.909128  24.690239
1    1.200612   0.909308  12.127829   0.068165
2    0.319300   1.090959  10.026856   2.820292
3    0.161077   5.264329  45.584786  14.331943
4    0.986175   6.005651  21.435989  43.167568
5    2.186847   3.007769  25.606710   6.356161
6    3.058137   3.289203   1.497210  30.235198
7    3.561368   7.235036  42.074584  34.768328
8    4.570267   6.146435  38.082155  25.776692
9    5.625450   4.597036  13.129102  30.828554
10   6.509298   4.753448  35.030285  11.719392
11   7.416091   4.681413  40.950990   4.631795
12   7.906281   8.757081  45.517269  43.152669
13   9.153628   5.294498  39.727699  10.385129
14  10.047289   5.353209  30.637231  18.285190
15  10.948437   5.337375  19.553133  27.192288
16  12.062944   3.197343   9.650550  12.299903
17  13.147110   1.359392   3.277846   3.379003
18  14.008587   1.738528   3.604305   0.248817
19  14.373334   7.063113  23.0

Again, we can see a huge difference at desired location only when k=1. However, we may not conclude that k=1 is always the best choice, more complex experiments regarding choice of k need to be done in the future with more complex data. 

For now, simply forget choice of k and noises. Let's keep eyes on some more complex linear combinations and nonlinear relationships.

In [116]:
# Again, create a base data with fairly large size
# This time, we use random values as base values
base_col = np.array(random.sample(range(50), 50))

# linear combination column
linear_col = 3 * base_col + 17
# Put an erroneous entry in add_col
loc = random.randint(0, 50)
linear_col[loc] = 1000
print('Error term is at %d' % loc)

# Bring it together
test_data4 = np.column_stack((base_col, linear_col))

result = show_difference(test_data4, 1)
print(result)
                    


Error term is at 1
            0         1
0   11.573962  1.246354
1   88.657912  9.547217
2    1.140495  0.122815
3   23.619237  2.543461
4   14.250690  1.534600
5   24.957601  2.687585
6   22.950055  2.471400
7   20.942509  2.255215
8   18.265781  1.966969
9   24.288419  2.615523
10   7.558870  0.813985
11  29.641874  3.192015
12  27.634329  2.975831
13   6.889688  0.741923
14  10.904780  1.174292
15  26.295965  2.831708
16   4.882143  0.525738
17   1.809677  0.194877
18   2.874597  0.309554
19  19.604145  2.111092
20   4.212961  0.453677
21  16.927418  1.822846
22   2.205415  0.237492
23  22.280873  2.399338
24   3.543779  0.381615
25  26.965147  2.903769
26  21.611691  2.327277
27  16.258236  1.750785
28   5.551325  0.597800
29  13.581508  1.462538
30  14.919872  1.606662
31  15.589054  1.678723
32  20.273327  2.183154
33   1.536233  0.165431
34  25.626783  2.759646
35  28.972692  3.119954
36  18.934963  2.039031
37   6.220507  0.669862
38   9.566416  1.030169
39  30.980238  3.3361

At desired location, we can observe a fairly large difference between terms, thus, it seems that this approach works for linear relationships. Now, let's try some nonlinear relationships

In [117]:
# Some columns
base_col = base_col + 1. # Avoid divide by zero
quadratic_col = base_col ** 2
cubic_col = base_col ** 3
exp_col = np.exp(base_col)
log_col = np.log(base_col)
scale_col = (base_col - np.mean(base_col)) / np.std(base_col)
sqrt_col = np.sqrt(base_col)

# Let's check them one by one
# To eliminate the size of output, we set 
# the error term to a fixed index 3

# Start with quadratic column
# Put an erroneous entry in add_col
loc = 3

quadratic_col[loc] = -1
quadratic_test = np.column_stack((base_col, quadratic_col))
quadratic_result = show_difference(quadratic_test, 1)
print('Output for quadratic relationship:')
print(quadratic_result.head())


# Now Cubic column
cubic_col[loc] = -1
cubic_test = np.column_stack((base_col, cubic_col))
cubic_result = show_difference(cubic_test, 1)
print('Output for cubic relationship:')
print(cubic_result.head())


# Now exponential column
exp_col[loc] = -1  # Set the error term to -1
exp_test = np.column_stack((base_col, exp_col))
exp_result = show_difference(exp_test, 1)
print('Output for exponential relationship')
print(exp_result.head())

# Now log column
log_col[loc] = -1  # Set the error term to -1
log_test = np.column_stack((base_col, log_col))
log_result = show_difference(log_test, 1)
print('Output for log relationship')
print(log_result.head())

# Now scale column
scale_col[loc] = -1  # Set the error term to -1
scale_test = np.column_stack((base_col, scale_col))
scale_result = show_difference(scale_test, 1)
print('Output for scale relationship')
print(scale_result.head())

# Now sqrt column
sqrt_col[loc] = -1  # Set the error term to -1
sqrt_test = np.column_stack((base_col, sqrt_col))
sqrt_result = show_difference(sqrt_test, 1)
print('Output for sqrt relationship')
print(sqrt_result.head())
      

Output for quadratic relationship:
           0         1
0  10.090369  0.249478
1  10.068335  0.248934
2   1.899941  0.046975
3  39.000883  0.964274
4   9.541414  0.235906
Output for cubic relationship:
           0         1
0  15.944607  0.008704
1  15.255809  0.008328
2   1.995632  0.001089
3  39.000534  0.021290
4  16.470630  0.008991
Output for exponential relationship
      0              1
0  21.0  991144.696824
1  19.0  107343.872947
2   2.0    7246.848034
3  39.0   77379.528958
4  25.0  116366.821945
Output for log relationship
          0         1
0  0.096242  0.986595
1  0.105423  1.080714
2  0.048126  0.493353
3  0.464252  4.759143
4  0.075385  0.772785
Output for scale relationship
          0         1
0  0.009462  0.628055
1  0.011095  0.736492
2  0.024981  1.658207
3  0.023911  1.587177
4  0.006194  0.411181
Output for sqrt relationship
          0         1
0  0.184842  1.136232
1  0.200957  1.235293
2  0.172568  1.060781
3  1.163998  7.155149
4  0.147868  0.908953


From outputs, we can see except for exponential relationship, all other relationships show some dramatic difference at our desired location, however, one thing to notice is that in this experiment, I set all erroneous terms to be -1 whereas our data values are all positive. We might be confident that our approach can detect such an obvious error, to be specific, when all data values are positive and the error terms are negative. To validate the approach, I redo the experiment with some positive error terms as follows:

In [119]:
# Some columns
base_col = base_col + 1. # Avoid divide by zero
quadratic_col = base_col ** 2
cubic_col = base_col ** 3
exp_col = np.exp(base_col)
log_col = np.log(base_col)
scale_col = (base_col - np.mean(base_col)) / np.std(base_col)
sqrt_col = np.sqrt(base_col)

# Let's check them one by one
# To eliminate the size of output, we set 
# the error term to a fixed index 3

# Start with quadratic column
# Put an erroneous entry in add_col
loc = 3

quadratic_col[loc] = 1000000
quadratic_test = np.column_stack((base_col, quadratic_col))
quadratic_result = show_difference(quadratic_test, 1)
print('Output for quadratic relationship:')
print(quadratic_result.head())


# Now Cubic column
cubic_col[loc] = 1000000
cubic_test = np.column_stack((base_col, cubic_col))
cubic_result = show_difference(cubic_test, 1)
print('Output for cubic relationship:')
print(cubic_result.head())


# Now exponential column
exp_col[loc] = 1000000  # Set the error term to -1
exp_test = np.column_stack((base_col, exp_col))
exp_result = show_difference(exp_test, 1)
print('Output for exponential relationship')
print(exp_result.head())

# Now log column
log_col[loc] = 1000000  # Set the error term to -1
log_test = np.column_stack((base_col, log_col))
log_result = show_difference(log_test, 1)
print('Output for log relationship')
print(log_result.head())

# Now scale column
scale_col[loc] = 1000000  # Set the error term to -1
scale_test = np.column_stack((base_col, scale_col))
scale_result = show_difference(scale_test, 1)
print('Output for scale relationship')
print(scale_result.head())

# Now sqrt column
sqrt_col[loc] = 1000000  # Set the error term to -1
sqrt_test = np.column_stack((base_col, sqrt_col))
sqrt_result = show_difference(sqrt_test, 1)
print('Output for sqrt relationship')
print(sqrt_result.head())

Output for quadratic relationship:
           0         1
0  21.979821  0.000916
1  19.983323  0.000833
2   2.999625  0.000125
3   1.691364  0.000071
4  25.971817  0.001083
Output for cubic relationship:
           0         1
0  20.967487  0.002033
1  19.224258  0.001864
2   2.997382  0.000291
3  56.967769  0.005524
4  24.295694  0.002356
Output for exponential relationship
      0             1
0  22.0  2.505025e+05
1  20.0  1.390902e+06
2   3.0  1.316613e+04
3  40.0  1.586534e+05
4  26.0  1.043555e+05
Output for log relationship
           0             1
0  21.999876  8.800922e-04
1  19.999880  8.000835e-04
2   2.999956  1.200115e-04
3   0.004418  1.764856e-07
4  25.999870  1.040110e-03
Output for scale relationship
           0             1
0  22.000012  8.800156e-04
1  20.000018  8.000144e-04
2   3.000065  1.200047e-04
3   0.000686  2.735760e-08
4  26.000001  1.040018e-03
Output for sqrt relationship
           0             1
0  21.999812  8.801543e-04
1  19.999821  8.001400e-0

Now, not suprisingly, outputs from all of them become a little odd. We may need to dig into this deeper to understand the odd behavior of outputs. 