Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Looking at vote counts instead of first digits shows why this is not evidence of fraud #9

Open
MechanicalTim opened this issue Nov 7, 2020 · 26 comments

Comments

@MechanicalTim
Copy link

MechanicalTim commented Nov 7, 2020

I suggest you plot a histograms of the vote counts per precinct, along with the histograms of the first digits. You will see immediately that these are not evidence of voter fraud, or even examples of data that should obey Benford's Law.

Instead, what you will see is that counties like Allegheny were chosen, where Biden almost always got more than 200 votes per precinct, and Trump did not. So, it superficially looks like Biden has a "shortage" of the digit 1. But, in fact, this normal distribution should not be expected to obey Benford's Law, even approximately.

Benford law 2020 Election Allegheny

@dogweather
Copy link

I think that running the code on the second digit (which also must follow the law) can control for this problem. ?

@dogweather
Copy link

dogweather commented Nov 7, 2020

So I gave it a shot, examining the 2nd instead of the 1st digit, and it makes all the data look fraudulent. :-P This can't be right, though. I think I might have applied Benford's incorrectly:

https://www.kaggle.com/dogweather/allegheny-cty-benford-s

@phillipnicol
Copy link

I believe the point of @MechanicalTim is correct. The problem is here is that the precinct sizes themselves probably do not follow Benford's law since they are probably all about the same size. Suppose every precinct in Allegheny has 500 voters, and Trump gets 20% and Biden gets 80%... then Trump would get many leading 1's and Biden would get many leading 4's.

@tkstanczak
Copy link

tkstanczak commented Nov 7, 2020

So I gave it a shot, examining the 2nd instead of the 1st digit, and it makes all the data look fraudulent. :-P This can't be right, though. I think I might have applied Benford's incorrectly:

https://www.kaggle.com/dogweather/allegheny-cty-benford-s

2nd digit should be more or less uniformly distributed so @dogweather - your analysis shows a wrong expected distribution line and the election numbers look fine:
image

@ghost
Copy link

ghost commented Nov 8, 2020

I think that running the code on the second digit (which also must follow the law) can control for this problem. ?

If the law doesn't apply to the first digit, then why should it apply to the second digit?

@pkit
Copy link

pkit commented Nov 8, 2020

@MechanicalTim I think your graphs just confirm the anomaly.
The whole point of Benford's law is to detect where the numbers were artificially tailored for a specific percentage.
I.e. the question Benford law answers is: if precinct sizes are naturally distributed and votes for a specific candidate are too, how do we detect the anomalies?

@MechanicalTim
Copy link
Author

I'm not sure I've read every comment in every issue, but I've seen not seen a single person explain why the code/plots in this repo are evidence of fraud. The argument seems to be

  • Here are first-digits distributions
  • Benford's Law!
  • Looks like voter fraud!

Voter fraud is an extraordinary claim, and therefore demands extraordinary evidence and a clear, cogent explanation. I have not seen an explanation of why the first digits should obey Benford's law. And @testes-t has some arguments in these issues about why they should not be expected to.

The explanation on my plots is simple. Biden got more votes here -- in a county where he was certainly expected to, and so has relatively few precincts that start with digit 1. (I have no doubt that one could find counties in say, Kentucky, where the pattern would be the same for Trump.) I fail to see how this means they were "artificially tailored".

The votes seem to follow a Poisson distribution, which explain the shapes of both Trump's and Biden's distributions, given that this is a county where Biden got a higher percentage.

@Marcotte67
Copy link

I agree that using the first digit of votes is somewhat flawed, due to the fact that the total number of voters per state / 1,000,000 is most often 2. And they each took about 1/2 of the votes in many states.

But, you can see that applying Benford's law just by the 51 states is very telling.

Trump had 23 (45%) leading 1's and Biden had 13 (25.5%) leading 1's. If we are expecting 30.1% leading 1's, then Trump is far more skewed (being 15% more than expected).

But, a sample size of >500 is needed to be accurate.

If we take the sum of both of their leading 1's (36 / 102) we get 35.3%.

Has anyone applied the law to Jorgensen? If you look at Johnson for 2016, the numbers are spot on.

@pkit
Copy link

pkit commented Nov 8, 2020

I have not seen an explanation of why the first digits should obey Benford's law.

Because they do obey Benford's law for all other candidates?

And @testes-t has some arguments in these issues about why they should not be expected to.

They are not expected to. But if Benford works for every candidate, except one - it's anomaly.
It doesn't mean there is a fraud, but it means there is anomaly to investigate.

The explanation on my plots

Irrelevant to Benford's law. And anomalies it detects.

which explain the shapes of both Trump's and Biden's distributions

You base the distributions on the possibly fake frequency numbers, it's a clear tautology.

@pkit
Copy link

pkit commented Nov 8, 2020

@Marcotte67

Has anyone applied the law to Jorgensen?

https://github.com/cjph8914/2020_benfords

@Marcotte67
Copy link

Since the number of votes is irrelevant, and just the distribution to demonstrate the possibility of fraud, would this work to remove the issue? (And again, a sample size of > at least 100 is needed I think)

If you take (# votes per precinct / 29) and use the first digit of the remainder. Then it's actually using the entire number to derive the distribution. Plus, precinct or state size becomes a non-issue.

I picked 29 (random prime) to remove the possibility of remainder being 0.

What are your thoughts?

I need to do all this math myself. I'll report back soon. But, based on just comparing the sum of the leading 1,2,3s at the state level (expected to be 60.2%).
Trump is 8.3% off and Biden is 3.3% off. I think that is reason enough to dive in deeper.

I actually think Louisiana, Oklahoma, Alabama, New Jersey and Oregon are the states to be focusing on. Not the battleground states!

@charlesmartin14
Copy link

charlesmartin14 commented Nov 8, 2020

The problem making a claim of potential fraud here is that it appears that the deviations from Benford's Law, the way it is applied here, may simply reflect small sample sizes in areas where Biden has taken a significantly more number of votes than Trump

So the Benford's Law method can not be interpreted naively as a direct suspicion of fraud.

Still, seems unusual that the turnout for Biden (the percent) was so high in these unusually small districts/precincts/wards. And this is what the graphs do show. This is especially odd for Milwaukee, where overall turnout did not increase compared to 2016, and where we know that Biden lost votes in the Black community wards.

Indeed, in some places, it appears that turnout for Biden is negatively correlated with size, where it is positively correlated for Trump. This in itself is curious. Moreover, there is not yet a good baseline for comparison, such as the 2016 data.

IMHO, it is unclear if the data that @MechanicalTim is showing is unusual or not. What the Benford's Law plots do effectively, however, is highlight this case. Benford's Law reflects something we know about natural data sets -- they are almost always heavy-tailed (i.e log-normal), not normally distributed. If a naive fraudster (i.e not the Russians) were to use a random ballot-stuffing scheme, we would expect the voting data to be normally distributed. And that's the point. Beford's Law tells us where to begin looking for fraud.

What we don't know from this is if this is typical of voting patterns from previous years and/or similar districts in other cities. To determine this, I think one needs both the same data from 2016, and something else, like census data, from both this and other cities.

There is more work to do.

@Marcotte67
Copy link

I just took all of the vote history since 1976 at the state level (summing by year, state, candidate). Just to see if Benford's Law made sense at this level. And to find max and avg deviation to see if Trump's 15% deviation from expected 1's was plausible.

IT WAS NOT!

image

As far as comparing at the precinct level for 2016. I'll do this next if someone can just send me an .xls file with the raw data. I'll send you my file that I used for the above pic.

Geman_Marcotte@yahoo.com
Please email, don't link here.

@charlesmartin14
Copy link

@Marcotte67 Thanks for looking this up. Sorry, I don't understand what you are saying. If you can share a notebook or some other writeup, or just clarify, that would be great. Thanks

@Marcotte67
Copy link

Hi. From 1976-2016, the total votes (using a sample of 3740 rows) - year, state, candidate)

About... 11 yrs x 51 states x 5.5 candidates per election

The average distribution and difference from Benford's was:
1's 32.4% (2.3% diff)
2's 19% (1.4% diff)
3's 11.6% (-0.9% diff)
4's 9.7% (-0.4% diff)

The proved to me that Benford's law can be applied to find anomalies.

This year, Trump had a leading 1 in 23 states this year. 23/51 is 45.1%. This is 15% different than the expected 30.1%.

Biden had 13/51 leading 1's which 25.5% and 4.6% different than 30.1%.

The highest difference in leading 1's since 1976 was in 2000 at 4.7%.

So, Trump's difference CAN NOT be ignored.

I need raw data for the states where he had a leading 1 to find where our efforts should be focused. And especially the states where he had a chance. So about 10 states.

Colorado, Arizona, Minnesota, Virginia, Louisiana, New Jersey - those are the main ones to start with.

@Marcotte67
Copy link

2nd digits have a zero so can't be used.

FYI - When I tried the 1st digit of reminder Mod(1047), the results were not correct.

@MechanicalTim
Copy link
Author

For the sake of consolidating all of these ideas, I am going to stop posting to this issue, in favor of following #11 and #17, which are more thoroughly developed, and seem more fruitful.

@charlesmartin14
Copy link

charlesmartin14 commented Nov 9, 2020

@Marcotte67 If I understand what you are doing, I think in order to test if the data obeys Benford's Law , you need to compute the confidence levels expected for a small subsample like all of Trumps 2020 vote counts for each state. To do this, you should subsample (N=51) from the total distribution many times, measure the average % distance from the expected Bedford data, and then compute the average and maximum %distance. These can then be used as the average case and worst case confidence bounds.

I think you may be doing this by looking at the average and worst cases differences for a single candidate It's just hard to tell without code..

@adt-automation
Copy link

adt-automation commented Nov 9, 2020

@Marcotte67 One caution about your state example- you only have 51 data points (one from each state) in one time period. And no other groups to compare that within the same time period.

However, this project is looking at precinct counts and then aggregating at the county level on each digit. County level aggregation is more relevant to applying Benfords law because we have over a thousand data points per county (plus the individual digits in the count). Like here- the counts from 1,323 polling places in Allegheny county that show voting counts for Biden that do not correlate to Benford's law (while the other candidates do correlate). At this aggregation level, you also have other counties during the same time period in non-swing states to compare with.
https://github.com/cjph8914/2020_benfords/blob/main/data/pa_allegheny_county.csv
image

@cristi-neagu
Copy link

I think it's important to note that the repo itself makes no claim that voter fraud occurred. It is just a first digit analysis and it provides no interpretation. Now, considering the importance of the topic, i think it is useful to discuss the merits and shortcomings of the method as it pertains to voter fraud. But if these plots show no evidence of fraud, that doesn't mean there's anything wrong with them.

@charlesmartin14
Copy link

charlesmartin14 commented Nov 9, 2020

Benford's Law rarely proves there is a fraud. Rather, it is used to detect potential fraud, which then requires a deeper inspection.

Seems to me that the Biden data shows anomalies that could be fraud (or could be something else), whereas the Trump data has no such anomalies.

That's exactly how Benford's Law is used to help detect other kinds of fraud. (Enron, in particular, is a famous case where the law was found to be useful, but, unfortunately, like this, after the fact)

And that's the whole point of this analysis.

@ghost
Copy link

ghost commented Nov 9, 2020

Biden's first-digit data above, if that's what you are referring to, just shows a normal/chi squared distribution within the same order of magnitude. Other charts are more interesting.

@charlesmartin14
Copy link

@estes-t See #17 and #31

@markr-github
Copy link

Posted to #17 but also relevant here: I'm confident we shouldn't expect these data to follow Benford's rule, it's just what happens when you get >40% of the vote in a lot of precincts of 500+ people. If you look at the Allegheny precincts that Trump won (upper right of figure below), his vote count also violates the rule, but I don't think that's evidence that those precincts were committing fraud for Trump.

Trump's load of "1" first digits comes from "blowout" counties where Biden won by a large margin and Trump only recorded 100--199 votes.

Allegheny_2020

@charlesmartin14
Copy link

@markr-github good stuff

@dogweather
Copy link

@markr-github Fascinating, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants