Looking at vote counts instead of first digits shows why this is not evidence of fraud #9

MechanicalTim · 2020-11-07T00:00:48Z

I suggest you plot a histograms of the vote counts per precinct, along with the histograms of the first digits. You will see immediately that these are not evidence of voter fraud, or even examples of data that should obey Benford's Law.

Instead, what you will see is that counties like Allegheny were chosen, where Biden almost always got more than 200 votes per precinct, and Trump did not. So, it superficially looks like Biden has a "shortage" of the digit 1. But, in fact, this normal distribution should not be expected to obey Benford's Law, even approximately.

dogweather · 2020-11-07T01:38:02Z

I think that running the code on the second digit (which also must follow the law) can control for this problem. ?

dogweather · 2020-11-07T03:11:23Z

So I gave it a shot, examining the 2nd instead of the 1st digit, and it makes all the data look fraudulent. :-P This can't be right, though. I think I might have applied Benford's incorrectly:

https://www.kaggle.com/dogweather/allegheny-cty-benford-s

phillipnicol · 2020-11-07T04:39:43Z

I believe the point of @MechanicalTim is correct. The problem is here is that the precinct sizes themselves probably do not follow Benford's law since they are probably all about the same size. Suppose every precinct in Allegheny has 500 voters, and Trump gets 20% and Biden gets 80%... then Trump would get many leading 1's and Biden would get many leading 4's.

tkstanczak · 2020-11-07T10:05:05Z

So I gave it a shot, examining the 2nd instead of the 1st digit, and it makes all the data look fraudulent. :-P This can't be right, though. I think I might have applied Benford's incorrectly:

https://www.kaggle.com/dogweather/allegheny-cty-benford-s

2nd digit should be more or less uniformly distributed so @dogweather - your analysis shows a wrong expected distribution line and the election numbers look fine:

ghost · 2020-11-08T12:45:41Z

I think that running the code on the second digit (which also must follow the law) can control for this problem. ?

If the law doesn't apply to the first digit, then why should it apply to the second digit?

pkit · 2020-11-08T13:02:57Z

@MechanicalTim I think your graphs just confirm the anomaly.
The whole point of Benford's law is to detect where the numbers were artificially tailored for a specific percentage.
I.e. the question Benford law answers is: if precinct sizes are naturally distributed and votes for a specific candidate are too, how do we detect the anomalies?

MechanicalTim · 2020-11-08T16:16:13Z

I'm not sure I've read every comment in every issue, but I've seen not seen a single person explain why the code/plots in this repo are evidence of fraud. The argument seems to be

Here are first-digits distributions
Benford's Law!
Looks like voter fraud!

Voter fraud is an extraordinary claim, and therefore demands extraordinary evidence and a clear, cogent explanation. I have not seen an explanation of why the first digits should obey Benford's law. And @testes-t has some arguments in these issues about why they should not be expected to.

The explanation on my plots is simple. Biden got more votes here -- in a county where he was certainly expected to, and so has relatively few precincts that start with digit 1. (I have no doubt that one could find counties in say, Kentucky, where the pattern would be the same for Trump.) I fail to see how this means they were "artificially tailored".

The votes seem to follow a Poisson distribution, which explain the shapes of both Trump's and Biden's distributions, given that this is a county where Biden got a higher percentage.

Marcotte67 · 2020-11-08T17:16:30Z

I agree that using the first digit of votes is somewhat flawed, due to the fact that the total number of voters per state / 1,000,000 is most often 2. And they each took about 1/2 of the votes in many states.

But, you can see that applying Benford's law just by the 51 states is very telling.

Trump had 23 (45%) leading 1's and Biden had 13 (25.5%) leading 1's. If we are expecting 30.1% leading 1's, then Trump is far more skewed (being 15% more than expected).

But, a sample size of >500 is needed to be accurate.

If we take the sum of both of their leading 1's (36 / 102) we get 35.3%.

Has anyone applied the law to Jorgensen? If you look at Johnson for 2016, the numbers are spot on.

pkit · 2020-11-08T17:19:28Z

I have not seen an explanation of why the first digits should obey Benford's law.

Because they do obey Benford's law for all other candidates?

And @testes-t has some arguments in these issues about why they should not be expected to.

They are not expected to. But if Benford works for every candidate, except one - it's anomaly.
It doesn't mean there is a fraud, but it means there is anomaly to investigate.

The explanation on my plots

Irrelevant to Benford's law. And anomalies it detects.

which explain the shapes of both Trump's and Biden's distributions

You base the distributions on the possibly fake frequency numbers, it's a clear tautology.

pkit · 2020-11-08T17:22:14Z

@Marcotte67

Has anyone applied the law to Jorgensen?

https://github.com/cjph8914/2020_benfords

Marcotte67 · 2020-11-08T17:43:51Z

Since the number of votes is irrelevant, and just the distribution to demonstrate the possibility of fraud, would this work to remove the issue? (And again, a sample size of > at least 100 is needed I think)

If you take (# votes per precinct / 29) and use the first digit of the remainder. Then it's actually using the entire number to derive the distribution. Plus, precinct or state size becomes a non-issue.

I picked 29 (random prime) to remove the possibility of remainder being 0.

What are your thoughts?

I need to do all this math myself. I'll report back soon. But, based on just comparing the sum of the leading 1,2,3s at the state level (expected to be 60.2%).
Trump is 8.3% off and Biden is 3.3% off. I think that is reason enough to dive in deeper.

I actually think Louisiana, Oklahoma, Alabama, New Jersey and Oregon are the states to be focusing on. Not the battleground states!

charlesmartin14 · 2020-11-08T17:56:18Z

The problem making a claim of potential fraud here is that it appears that the deviations from Benford's Law, the way it is applied here, may simply reflect small sample sizes in areas where Biden has taken a significantly more number of votes than Trump

So the Benford's Law method can not be interpreted naively as a direct suspicion of fraud.

Still, seems unusual that the turnout for Biden (the percent) was so high in these unusually small districts/precincts/wards. And this is what the graphs do show. This is especially odd for Milwaukee, where overall turnout did not increase compared to 2016, and where we know that Biden lost votes in the Black community wards.

Indeed, in some places, it appears that turnout for Biden is negatively correlated with size, where it is positively correlated for Trump. This in itself is curious. Moreover, there is not yet a good baseline for comparison, such as the 2016 data.

IMHO, it is unclear if the data that @MechanicalTim is showing is unusual or not. What the Benford's Law plots do effectively, however, is highlight this case. Benford's Law reflects something we know about natural data sets -- they are almost always heavy-tailed (i.e log-normal), not normally distributed. If a naive fraudster (i.e not the Russians) were to use a random ballot-stuffing scheme, we would expect the voting data to be normally distributed. And that's the point. Beford's Law tells us where to begin looking for fraud.

What we don't know from this is if this is typical of voting patterns from previous years and/or similar districts in other cities. To determine this, I think one needs both the same data from 2016, and something else, like census data, from both this and other cities.

There is more work to do.

Marcotte67 · 2020-11-08T19:16:39Z

I just took all of the vote history since 1976 at the state level (summing by year, state, candidate). Just to see if Benford's Law made sense at this level. And to find max and avg deviation to see if Trump's 15% deviation from expected 1's was plausible.

IT WAS NOT!

As far as comparing at the precinct level for 2016. I'll do this next if someone can just send me an .xls file with the raw data. I'll send you my file that I used for the above pic.

Geman_Marcotte@yahoo.com
Please email, don't link here.

charlesmartin14 · 2020-11-08T19:22:39Z

@Marcotte67 Thanks for looking this up. Sorry, I don't understand what you are saying. If you can share a notebook or some other writeup, or just clarify, that would be great. Thanks

Marcotte67 · 2020-11-08T19:42:09Z

Hi. From 1976-2016, the total votes (using a sample of 3740 rows) - year, state, candidate)

About... 11 yrs x 51 states x 5.5 candidates per election

The average distribution and difference from Benford's was:
1's 32.4% (2.3% diff)
2's 19% (1.4% diff)
3's 11.6% (-0.9% diff)
4's 9.7% (-0.4% diff)

The proved to me that Benford's law can be applied to find anomalies.

This year, Trump had a leading 1 in 23 states this year. 23/51 is 45.1%. This is 15% different than the expected 30.1%.

Biden had 13/51 leading 1's which 25.5% and 4.6% different than 30.1%.

The highest difference in leading 1's since 1976 was in 2000 at 4.7%.

So, Trump's difference CAN NOT be ignored.

I need raw data for the states where he had a leading 1 to find where our efforts should be focused. And especially the states where he had a chance. So about 10 states.

Colorado, Arizona, Minnesota, Virginia, Louisiana, New Jersey - those are the main ones to start with.

Marcotte67 · 2020-11-08T19:48:54Z

2nd digits have a zero so can't be used.

FYI - When I tried the 1st digit of reminder Mod(1047), the results were not correct.

MechanicalTim · 2020-11-08T20:52:30Z

For the sake of consolidating all of these ideas, I am going to stop posting to this issue, in favor of following #11 and #17, which are more thoroughly developed, and seem more fruitful.

charlesmartin14 · 2020-11-09T04:14:28Z

@Marcotte67 If I understand what you are doing, I think in order to test if the data obeys Benford's Law , you need to compute the confidence levels expected for a small subsample like all of Trumps 2020 vote counts for each state. To do this, you should subsample (N=51) from the total distribution many times, measure the average % distance from the expected Bedford data, and then compute the average and maximum %distance. These can then be used as the average case and worst case confidence bounds.

I think you may be doing this by looking at the average and worst cases differences for a single candidate It's just hard to tell without code..

adt-automation · 2020-11-09T13:26:08Z

@Marcotte67 One caution about your state example- you only have 51 data points (one from each state) in one time period. And no other groups to compare that within the same time period.

However, this project is looking at precinct counts and then aggregating at the county level on each digit. County level aggregation is more relevant to applying Benfords law because we have over a thousand data points per county (plus the individual digits in the count). Like here- the counts from 1,323 polling places in Allegheny county that show voting counts for Biden that do not correlate to Benford's law (while the other candidates do correlate). At this aggregation level, you also have other counties during the same time period in non-swing states to compare with.
https://github.com/cjph8914/2020_benfords/blob/main/data/pa_allegheny_county.csv

cristi-neagu · 2020-11-09T17:40:10Z

I think it's important to note that the repo itself makes no claim that voter fraud occurred. It is just a first digit analysis and it provides no interpretation. Now, considering the importance of the topic, i think it is useful to discuss the merits and shortcomings of the method as it pertains to voter fraud. But if these plots show no evidence of fraud, that doesn't mean there's anything wrong with them.

charlesmartin14 · 2020-11-09T17:54:18Z

Benford's Law rarely proves there is a fraud. Rather, it is used to detect potential fraud, which then requires a deeper inspection.

Seems to me that the Biden data shows anomalies that could be fraud (or could be something else), whereas the Trump data has no such anomalies.

That's exactly how Benford's Law is used to help detect other kinds of fraud. (Enron, in particular, is a famous case where the law was found to be useful, but, unfortunately, like this, after the fact)

And that's the whole point of this analysis.

ghost · 2020-11-09T17:56:48Z

Biden's first-digit data above, if that's what you are referring to, just shows a normal/chi squared distribution within the same order of magnitude. Other charts are more interesting.

charlesmartin14 · 2020-11-09T18:00:59Z

@estes-t See #17 and #31

markr-github · 2020-11-10T16:38:49Z

Posted to #17 but also relevant here: I'm confident we shouldn't expect these data to follow Benford's rule, it's just what happens when you get >40% of the vote in a lot of precincts of 500+ people. If you look at the Allegheny precincts that Trump won (upper right of figure below), his vote count also violates the rule, but I don't think that's evidence that those precincts were committing fraud for Trump.

Trump's load of "1" first digits comes from "blowout" counties where Biden won by a large margin and Trump only recorded 100--199 votes.

charlesmartin14 · 2020-11-10T17:40:37Z

@markr-github good stuff

dogweather · 2020-11-10T21:01:42Z

@markr-github Fascinating, thanks!

jimfcarroll mentioned this issue Nov 7, 2020

For numbers summed to 100%, there can be no such thing that only one member doesn't follow Benford's rule. #13

Open

ghost mentioned this issue Nov 8, 2020

Reach out to the voter integrity project #21

Open

charlesmartin14 mentioned this issue Nov 8, 2020

Milwaukee ward sizes are small and there is a highly preferred candidate #17

Open

ghost mentioned this issue Nov 11, 2020

The average ward in Milwaukee has 750 votes, how would Biden have 100-200 in 30% of wards? #36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Looking at vote counts instead of first digits shows why this is not evidence of fraud #9

Looking at vote counts instead of first digits shows why this is not evidence of fraud #9

MechanicalTim commented Nov 7, 2020 •

edited

Loading

dogweather commented Nov 7, 2020

dogweather commented Nov 7, 2020 •

edited

Loading

phillipnicol commented Nov 7, 2020

tkstanczak commented Nov 7, 2020 •

edited

Loading

ghost commented Nov 8, 2020

pkit commented Nov 8, 2020 •

edited

Loading

MechanicalTim commented Nov 8, 2020

Marcotte67 commented Nov 8, 2020

pkit commented Nov 8, 2020

pkit commented Nov 8, 2020

Marcotte67 commented Nov 8, 2020

charlesmartin14 commented Nov 8, 2020 •

edited

Loading

Marcotte67 commented Nov 8, 2020

charlesmartin14 commented Nov 8, 2020

Marcotte67 commented Nov 8, 2020

Marcotte67 commented Nov 8, 2020

MechanicalTim commented Nov 8, 2020

charlesmartin14 commented Nov 9, 2020 •

edited

Loading

adt-automation commented Nov 9, 2020 •

edited

Loading

cristi-neagu commented Nov 9, 2020

charlesmartin14 commented Nov 9, 2020 •

edited

Loading

ghost commented Nov 9, 2020

charlesmartin14 commented Nov 9, 2020

markr-github commented Nov 10, 2020

charlesmartin14 commented Nov 10, 2020

dogweather commented Nov 10, 2020

Looking at vote counts instead of first digits shows why this is not evidence of fraud #9

Looking at vote counts instead of first digits shows why this is not evidence of fraud #9

Comments

MechanicalTim commented Nov 7, 2020 • edited Loading

dogweather commented Nov 7, 2020

dogweather commented Nov 7, 2020 • edited Loading

phillipnicol commented Nov 7, 2020

tkstanczak commented Nov 7, 2020 • edited Loading

ghost commented Nov 8, 2020

pkit commented Nov 8, 2020 • edited Loading

MechanicalTim commented Nov 8, 2020

Marcotte67 commented Nov 8, 2020

pkit commented Nov 8, 2020

pkit commented Nov 8, 2020

Marcotte67 commented Nov 8, 2020

charlesmartin14 commented Nov 8, 2020 • edited Loading

Marcotte67 commented Nov 8, 2020

charlesmartin14 commented Nov 8, 2020

Marcotte67 commented Nov 8, 2020

Marcotte67 commented Nov 8, 2020

MechanicalTim commented Nov 8, 2020

charlesmartin14 commented Nov 9, 2020 • edited Loading

adt-automation commented Nov 9, 2020 • edited Loading

cristi-neagu commented Nov 9, 2020

charlesmartin14 commented Nov 9, 2020 • edited Loading

ghost commented Nov 9, 2020

charlesmartin14 commented Nov 9, 2020

markr-github commented Nov 10, 2020

charlesmartin14 commented Nov 10, 2020

dogweather commented Nov 10, 2020

MechanicalTim commented Nov 7, 2020 •

edited

Loading

dogweather commented Nov 7, 2020 •

edited

Loading

tkstanczak commented Nov 7, 2020 •

edited

Loading

pkit commented Nov 8, 2020 •

edited

Loading

charlesmartin14 commented Nov 8, 2020 •

edited

Loading

charlesmartin14 commented Nov 9, 2020 •

edited

Loading

adt-automation commented Nov 9, 2020 •

edited

Loading

charlesmartin14 commented Nov 9, 2020 •

edited

Loading