Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Milwaukee ward sizes are small and there is a highly preferred candidate #17

Open
frycast opened this issue Nov 7, 2020 · 83 comments
Open

Comments

@frycast
Copy link

frycast commented Nov 7, 2020

The disappearance of Benford's law in Milwaukee is a function of voter preference alone. If one candidate has between 60% and 80% average chance of receiving a vote, then the sizes of the wards in Milwaukee are too small to accommodate Benford's law. See further details with my simulations here https://rpubs.com/frycast/687633

Edit: Not just too small, but too concentrated. They do not span many orders of magnitude.

Edit 2: The thread below becomes distracted by an effort to look into election data anomalies that are not directly related to this issue. My intention here is not to develop a fraud detection tool, but to highlight the major flaws with the one being used, and currently being touted by various news sources as evidence of fraud. So far, this issue is still open, and should be resolved by at least adding some comments to the README clarifying that the pattern observed in Milwaukee is a pattern that can arise in election data absent of fraud. Hopefully the owner of this popular repository, and the people involved here in this thread, are all interested in acting in good faith, and will focus on resolving the issue.

@chavenor
Copy link

chavenor commented Nov 7, 2020

@tcauth
Copy link

tcauth commented Nov 7, 2020

The disappearance of Benford's law in Milwaukee is a function of voter preference alone. If one candidate has between 60% and 80% average chance of receiving a vote, then the sizes of the wards in Milwaukee are too small to accommodate Benford's law. See further details with my simulations here https://rpubs.com/frycast/687633

but according to your simulation:
Trump's 39% chance should also show some observable difference from Benford's law, however ...

@dshield55
Copy link

@frycast thanks for your great work. I'd like to second @tcauth , wondering if there's a reason for that

@chavenor
Copy link

chavenor commented Nov 7, 2020

@tcauth would the smaller portion of votes possibly not move the needle due to the fact that they are blending in with the rest of the wards of equal size?

@chavenor
Copy link

chavenor commented Nov 7, 2020

@tcauth what does the 2016 election look like -- are we able to get the data to run it?

@charlesmartin14
Copy link

charlesmartin14 commented Nov 8, 2020

If Biden is 61%, then say Trump is 39%. (ignoring the other candidates) We can take a screenshot to see the Benfords lawfor p=0.30 and see it is violated for this case as well

Screen Shot 2020-11-07 at 6 15 09 PM

@charlesmartin14
Copy link

And this is the point of one the main research papers that claims not to use Benfords law in these situations.

https://www.cambridge.org/core/journals/political-analysis/article/benfords-law-and-the-detection-of-election-fraud/3B1D64E822371C461AF3C61CE91AAF6D?fbclid=IwAR2x18HnGzK7rQDrDhid25i-gUMo31xo6tyJT7UMK97YJWva7vbPApHWKSg

image

I think the issue to have the proper baselines, which may be determined by looking at the 2016 data

@frycast
Copy link
Author

frycast commented Nov 8, 2020

but according to your simulation:
Trump's 39% chance should also show some observable difference from Benford's law, however ...

Trump's average vote probability was roughly 26% in Milwaukee for this election, and Biden's was roughly 73%. With those numbers, Trump gets a nice Benford's law in my simulation, and Biden gets the observed spike around the digit 4. I've updated the notebook at the same link so you can see Trump's distributions too: https://rpubs.com/frycast/687633

It makes sense: the populations in Milwaukee Wards are often around 500-1000, so 73% of a random Milwaukee ward population is a number that starts with a digit somewhere between 3 and 7.

@tcauth
Copy link

tcauth commented Nov 8, 2020

make sense. can we assume that if the samples are large enough to a state -> county level, the distribution should be better?
nytime obtains a very good html source of that kind of data hence pd.read_html() read them directly. I got a MI chart from this:
https://www.nytimes.com/interactive/2020/11/03/us/elections/results-michigan.html'

image
https://github.com/tcauth/benford2020usvote/blob/main/benford.ipynb

@tcauth
Copy link

tcauth commented Nov 8, 2020

but according to your simulation:
Trump's 39% chance should also show some observable difference from Benford's law, however ...

Trump's average vote probability was roughly 26% in Milwaukee for this election, and Biden's was roughly 73%. With those numbers, Trump gets a nice Benford's law in my simulation, and Biden gets the observed spike around the digit 4. I've updated the notebook at the same link so you can see Trump's distributions too: https://rpubs.com/frycast/687633

It makes sense: the populations in Milwaukee Wards are often around 500-1000, so 73% of a random Milwaukee ward population is a number that starts with a digit somewhere between 3 and 7.
And I suppose the 1 should be the most important one, right?

@1e100
Copy link

1e100 commented Nov 8, 2020

Care to explain this? The lower the turnout, the higher the Biden advantage.
milwaukee_wards

@1e100
Copy link

1e100 commented Nov 8, 2020

Here's Allegheny:
allegheny

@tcauth
Copy link

tcauth commented Nov 8, 2020

Care to explain this? The lower the turnout, the higher the Biden advantage.
milwaukee_wards

What are the x-axis and y-axis? Turnout and votes?

@1e100
Copy link

1e100 commented Nov 8, 2020

Y is turnout, X is percentage of the votes for a given candidate in each precinct. Color is the size of the bucket. It's a 2D histogram.

@tcauth
Copy link

tcauth commented Nov 8, 2020

Y is turnout, X is percentage of the votes for a given candidate in each precinct. Color is the size of the bucket. It's a 2D histogram.

And left is Biden right is Trump?

@1e100
Copy link

1e100 commented Nov 8, 2020

Yes. Earlier comment I made had them inadvertently swapped on the Allegheny histogram. I've deleted that comment and uploaded the correct one. I could push my fork if someone wants to play with this. In particular, it'd be interesting to see if this is a normally occurring phenomenon by comparing these with a county where Trump won. I started looking into Miami Dade, but precinct-level voter turnout data is not easily available for it, unfortunately.

@1e100
Copy link

1e100 commented Nov 8, 2020

To be clear (for folks who will repost this on social media) - I'm not alleging any fraud or anything like that. Just surfacing a pattern which I thought was counterintuitive, and seeking an explanation.

@tcauth
Copy link

tcauth commented Nov 8, 2020

Yes. Earlier comment I made had them inadvertently swapped on the Allegheny histogram. I've deleted that comment and uploaded the correct one. I could push my fork if someone wants to play with this. In particular, it'd be interesting to see if this is a normally occurring phenomenon by comparing these with a county where Trump won. I started looking into Miami Dade, but precinct-level voter turnout data is not easily available for it, unfortunately.

It seems to be very easy to get and process state level which should be very solid in statistics

@1e100
Copy link

1e100 commented Nov 8, 2020

Well, precinct level turnout data is technically available, but "available" in this case means a gigantic PDF, from which hundreds of values need to be copied out by hand. I don't have that kind of time or motivation.

@frycast
Copy link
Author

frycast commented Nov 8, 2020

Care to explain this? The lower the turnout, the higher the Biden advantage.

That's interesting. I haven't played around with that data yet. Do those turnouts count mail-in ballots? Is there also a negative correlation between turnout and something else that is correlated with Biden support, such as population density?

@1e100
Copy link

1e100 commented Nov 8, 2020

Yes:

  1. Turnout = all votes / registered voters
  2. Vote share = votes for candidate / all votes

@1e100
Copy link

1e100 commented Nov 8, 2020

I haven't studied other confounders, perhaps someone else will.

@frycast
Copy link
Author

frycast commented Nov 8, 2020

can we assume that if the samples are large enough to a state -> county level, the distribution should be better?

There's a pretty informative thread evolving on this in general here https://skeptics.stackexchange.com/questions/49782/do-vote-counts-for-joe-biden-in-the-2020-election-violate-benfords-law

@pkit
Copy link

pkit commented Nov 8, 2020

@frycast it has nothing to do with what presented here. They discuss other graphs of unknown source.

@charlesmartin14
Copy link

charlesmartin14 commented Nov 8, 2020

Trump's average vote probability was roughly 26% in Milwaukee for this election, and Biden's was roughly 73%.

That's very helpful, thanks

I assume you are estimating the true probabilities using the observed frequencies. But what if these are wrong, and significantly so ?

I think the big question is, Is it possible to detect this kind of fraud, with any level of. confidence, using advanced techniques. And then correlate it with real world behavior ?

A good starting point it seems would be to look at earlier data from 2016, and try to use this as a baseline for the analysis to and to estimate Biden and Trumps's true 'expected probability'm with some error bars, from that frequency data.

In Milwaukee, however, we see that turnout was not significantly different from 2016, and that the expected probabilities for Biden wold, in fact, be lower than 2016 in majority-Black wards.

https://www.jsonline.com/story/news/politics/elections/2020/11/07/election-results-milwaukee-turnout-flat-despite-wisconsin-surge/6188097002/

Is it possible to get the 2016 data ?

@pkit
Copy link

pkit commented Nov 8, 2020

@charlesmartin14

I think the big question is, Is it possible to detect this kind of fraud, with any level of. confidence, using advanced techniques. And then correlate it with real world behavior ?

You cannot detect fraud using statistics alone. But you can detect statistical anomalies.
Benford law helps in detecting these. Seeing it triggering for one candidate is an anomaly.

@charlesmartin14
Copy link

charlesmartin14 commented Nov 8, 2020

You cannot detect fraud using statistics alone. But you can detect statistical anomalies.
Benford law helps in detecting these. Seeing it triggering for one candidate is an anomaly.

Thanks for clarifying. By 'detect fraud' I mean to detect anomalies suggestive of fraud.

(I deleted this last part by accident when editing)

Is it possible to get the 2016 data ?

@dshield55
Copy link

@frycast, you want to run that in base 16 for giggles?

@frycast
Copy link
Author

frycast commented Nov 8, 2020

I assume you are estimating the true probabilities using the observed frequencies. But what if these are wrong, and significantly so ?

"Empirical probability" means the probability on the observed data alone. There is no need for any estimate of the true probability here. We only need the empirical probability.

This is because my simulation demonstrates that the disagreement with Benford's law arises even if the true probability is equal to the empirical probability. So, in other words, if there is no fraud, then you still get the observed disagreement in Milwaukee.

My simulation cautions against an erroneous use of Benford's law to try to detect election fraud in Milwaukee, since the observed result is exactly what we would expect if there was no fraud.

@1e100
Copy link

1e100 commented Nov 9, 2020

I've pushed the code to https://github.com/1e100/2020_benfords. Disclaimer, once again: I do not claim there is any fraud here. I'd like to see an explanation to Biden's "the lower the turnout, the higher the vote"

The entire story of elections in America is that democrats outnumber republicans, but republicans show up to vote. There are multiple theories around this ranging from just being less enthusiastic about their candidates to voter suppression, but that's out of the scope of this.

For some evidence, as of this election there are 4.2 mil dems vs 3.5 mil republicans in PA:
https://docs.google.com/spreadsheets/d/1LEkTZN_1Ee5AVkxqgVdh1OWadz85qxqxU8HHF8BvcCY/edit?usp=sharing

However with most of the vote in, there are only 3.35 million votes for Biden, and 3.31 million votes for Trump. So the expectation is that turnout among dems would be proportionally lower. These graphs are consistent with that -- districts where Biden has a higher percentage of the vote tend to have a higher percentage of democrats who tend to drag down the turnout percentage. There are just more people in those districts, so that lower percentage equates to a higher raw total of votes.

Here are the results from Allegheny in 2016 showing a similar correlation: https://imgur.com/a/nmdtoh4

Here's a google sheet for Allegheny from 2016 if you want to play with the data yourself: https://docs.google.com/spreadsheets/d/1r9fVxYwIKQkUz8SYHCQmxWSG5bS8GE6fOnN_YtOvAwk/edit?usp=sharing

Could be. According to this data GOP voter turnout in Allegheny slightly exceeds 100% (so there are probably some unaffiliated in it), whereas Joe's voter turnout is about 74%. I'm not sure what "count of all other voters" means in the spreadsheet, though.

@charlesmartin14
Copy link

charlesmartin14 commented Nov 9, 2020

On the voter participation data...

I think the question that is relevant to this thread is,

If are the Biden vote counts distributed normally in the range 300-400, (and therefore non-Benford) , whereas the Trump are Benford-like, are the overall voter participation trends consistent with vote distributions seen ?

@frycast
Copy link
Author

frycast commented Nov 9, 2020

So you may be using a methodology that generates non-Benford data in all cases, and claiming it is evidence that the distributions are non-Benford on certain subcases.

That is a good criticism.

I think this isn't a problem here though. My argument for this is that the simulated vote count distributions do look visually very similar to the observed ones, for both Biden and Trump (and not just the Benford distributions).

A visual comparison would not be sufficient in many cases, but in this case, especially since no inference is being made about the true distribution, we can see that, even if the underlying data generating process is being misrepresented, there is enough agreement to justify a visual comparison of the Benfords.

So a clearer conclusion is, if the data are generated binomially, with no difference in DGP between Biden and Trump, other than the probability of receiving a vote, then the observed data visually agree in count and benford distribution for both Biden and Trump.

@charlesmartin14
Copy link

@frycast That's fine

I think the hump at 3 for the Biden first digit data can also be inferred just by looking at the distribution of vote counts. And that's the object to look at: See #31

@markr-github
Copy link

frycast, you seem right.

With most precints >500 voters, winners' counts will generally start with 2 or more so anyone who wins many counties won't look like Benford's Law. If you want to apply Benford, then I'd apply Hitchens and say you need to demonstrate that the numbers should follow it; “That which can be asserted without evidence, can be dismissed without evidence."

Here are the Allegheny PA vote counts split by precinct winner.
Allegheny_candidate_counts_by_winner

The Trump counts in Trump precincts don't follow Benford but that's not evidence that Trump counties were committing fraud to give him the election.

Similarly the UK elections in 2019 where the Conservatives won most seats, below are counts by constituency for the four largest parties and the Conservative non-Benford-ness is not fraud, it's just the result when you win lots of ~50k-vote constituencies and typically run the vote close when you lose.

UK_election_benfords

@charlesmartin14
Copy link

charlesmartin14 commented Nov 10, 2020

@markr-github That's very interesting. Let me suggest plotting the vote count distributions themselves.

Benford's Law data is heavy-tailed, but heavy-tailed data may not be Benford.

We can see if the data is heavy-tailed or not by looking at plots of the vote count distributions

(and by checking the tail statistics; you can use the powerlaw packages in R or python to do this)

*Also, can you share the data sets and notebooks if checked in

@markr-github
Copy link

@charlesmartin14
Didn't see anything in the vote distributions, total votes per precinct etc that say Benford's should apply to these numbers.

Not in a notebook, but the code and data are here:
https://github.com/markr-github/benford-election

@charlesmartin14
Copy link

charlesmartin14 commented Nov 10, 2020

@markr-github. Thanks I'll take a look after work today.

I think what we have learned so far is that when we see deviations from Benford's Law, the data is clustered around a high (say 100-200 votes). I'm just a bit surprised that in the cases I have looked at so far (Biden's Election Day data for Allegheny) the data appears nearly perfectly Gaussian and not seemingly heavy-tailed (i.e Biden's Absentee Data, Trump's data, etc) That is, it appears that there are (unusually?) very few districts with really high turnout for Biden. See #31

But maybe there is just not enough data to see the tail? That could certainly be, and it may be necessary to study total Biden districts across say an entire state ? I'm still checking that and need to do more careful tests.

This also, however, appears to be how the Biden vote distributions. That's the in charts that @andrewzigerelli is showing above if I understand this correctly. The higher the turnout in a district, the lower the Biden percentage. And the exact opposite for Trump.

@markr-github
Copy link

@charlesmartin14
Seems like a different topic to the Benford issues?

That relationshpi between turnout vs margin is exactly what I would have guessed beforehand so it doesn't surprise me.

In every US presidential election with data, the highest-turnout ethnic group has been "white non-hispanic":
https://www.statista.com/statistics/1096113/voter-turnout-presidential-elections-by-ethnicity-historical/
And turnout is higher for older versus younger voters:
https://www.politifact.com/article/2020/mar/04/closer-look-turnout-young-voters-and-key-bernie-sa/
I would expect that groups that are older and more non-hispanic white will (i) have higher turnout and (ii) have a more pro-Trump margin.

If you were convinced that fraud was happening then the naive approach would be to look at high-turnout areas, since more ballots increases the probability of "fake" ballots being included. I don't think there's any evidence that Trump precincts were fabricating votes though.

@charlesmartin14
Copy link

charlesmartin14 commented Nov 10, 2020

@markr-github

Seems like a different topic to the Benford issues?

I asked the question because I see Benford's Law as a statistical test for heavy-tailed behavior, characteristic of natural (i.e. not fake) data. I agree, I don't think it can be interpreted using a naive approach However, there are other tests for heavy-tailed behavior, more suitable to finite-size systems, that might prove more useful here.

The simplest of these is to fit the tail of the data to truncated power law distribution , and then compare this to an exponential distribution using a non-parametric Kolmogorov–Smirnov test see #31

more ballots increases the probability of "fake" ballots

But is also increases the probability of "real" ballots, so it says nothing about the signal-to-noise ratio, which will certainly affect any estimator we use

@charlesmartin14
Copy link

charlesmartin14 commented Nov 10, 2020

@frycast

And notice...Taleb also used normal random data as an example of Benford

https://twitter.com/nntaleb/status/1326212740273278978

This seems qualitatively correct.

I don't think it's helpful, however, to chime in. I prefer to avoid a flame war on Twitter.

There are lots of smart people here and I think we should just figure this out ourselves. Maybe there is something here, maybe not. I'm hoping to see more once we dig into the vote distributions.

@chavenor
Copy link

This just hit the web. Do we have a way to check this or comment on it? Do we need to open another issue?
https://www.pscp.tv/w/1BdGYYjgkgQGX

@MechanicalTim
Copy link

I would encourage anyone planning on watching that video to read Dr. Shiva Ayyadurai's Wikipedia page as well. Here are the first few sentences, for your convenience:

V. A. Shiva Ayyadurai (born Vellayappa Ayyadurai Shiva,[2] December 2, 1963)[3] is an Indian-American scientist, engineer, politician, entrepreneur, and promoter of conspiracy theories and unfounded medical claims. He is notable for his widely discredited claim to be the "inventor of email".

@chavenor
Copy link

@MechanicalTim agreed. Can the data be grabbed and can we run this on our own to either confirm or deny the outcome?

@charlesmartin14
Copy link

@chavenor We should try to get the data ourselves. I would also suggest to reach out to the researcher at MIT

@chavenor
Copy link

chavenor commented Nov 10, 2020

@charlesmartin14 I'm way ahead of you. Already asked on Twitter. Who was the guy from MIT? Did they have their info on that presentaiton?
ok-fine

@chavenor
Copy link

I found the other guys and reached out on LinkedIn. Hope they can share their data with us so we can double-check it.

@alexsullivan114
Copy link

alexsullivan114 commented Nov 10, 2020

@chavenor For reference, Dr. Shiva Ayyadurai ran for the senate as a Republican in Massachusetts. He's considered a bit of a joke over here.

@charlesmartin14
Copy link

@alexsullivan114 It doesn't matter. What matters is getting data and doing our own honest analysis.

@chavenor
Copy link

chavenor commented Nov 10, 2020

@alexsullivan114 there seems to be a trend that anyone that is anti-establishment gets the "crazy stamp" -- I've moved beyond that prism.

They made claims. I've asked for the data. If we get it and can verify the results then that is all the proof we should need.

I didn't see that @charlesmartin14 already responed. Tossed ya a thumbs-up happy to have your input.

@alexsullivan114
Copy link

Sure - totally fair. I was just trying to add some context about who this person was - of course the data should stand on its own.

@charlesmartin14
Copy link

charlesmartin14 commented Nov 10, 2020

@alexsullivan114

There are 3 people presenting, one of which is a state election commissioner.
https://www.shelbyvote.com/team/bennie-smith

Remember also that there are claims that some media companies like Twitter, CNN, etc. are actively censoring information claiming to be (potential) evidence of fraud. So he may be forced to go 'underground' , so to speak.

The data should speak for itself

@MechanicalTim
Copy link

It seems to me that the Shiva stuff is a case of deliberately deceptive plotting.

They display plots using the following data:

  • fraction who voted "straight Republican" (but guessing this means for non-Prez races?)
  • fraction who voted for Trump

If we posit that people who are more likely to vote straight Republican are more likely to vote for Trump, then the mean percentages voting for Republican, and voting for Trump, might like something like this:

repub_prec_fraction = [20; 30; 40; 50; 60; 70; 80]; % and rough approx of "straight Rep"
trump_likelihood = [25; 30; 35; 40; 45; 50; 55];

THE ABOVE ARE NOT REAL DATA! USED FOR ILLUSTRATIVE PURPOSES ONLY!

(Also, excuse the MATLAB syntax.)

Here are two subplots:

  • Top: Plot that relationship straightforwardly
  • Bottom: Plot it using the contrived variable from the video

Shiva nonsense

(In Shiva's plot, there is of course the random scatter of real data around those lines.)

He then claims that this shape is somehow evidence of Biden stealing votes from Trump.

I have admittedly over-simplified a bit, for the sake of making my fundamental point more directly. But I think this is at the heart of Shiva's plot. I think he is obscuring truth, not revealing it.

Shiva does other deceptive things on the plot, like adding lines to "guide the eye", which, if you ignore them, you realize do not actually follow the data. There are also edge effects on the plot, that he ignores. Finally, he also makes verbal statements that are similarly deceiving.

I rate the video 1 out of 10. Would not watch again. (Disclaimer: I only watched the first 37 minutes before writing this.)

@charlesmartin14
Copy link

This should be moved to another thread

@chavenor
Copy link

@MechanicalTim I do not believe that is what they are saying - I took - Straight ticket as assuming that all Republicans vote for Trump and as a precinct gets more Republican you would expect that the number would be at 0% not down -25%. Also, this does play into the discussion above about lower Dem turnout and trying to figure out where the votes came from.

I'll wait for the data so can just see what they did.

Moved here. #38

@RexRookie
Copy link

It seems to me that the Shiva stuff is a case of deliberately deceptive plotting.

They display plots using the following data:

  • fraction who voted "straight Republican" (but guessing this means for non-Prez races?)
  • fraction who voted for Trump

If we posit that people who are more likely to vote straight Republican are more likely to vote for Trump, then the mean percentages voting for Republican, and voting for Trump, might like something like this:

repub_prec_fraction = [20; 30; 40; 50; 60; 70; 80]; % and rough approx of "straight Rep"
trump_likelihood = [25; 30; 35; 40; 45; 50; 55];

THE ABOVE ARE NOT REAL DATA! USED FOR ILLUSTRATIVE PURPOSES ONLY!

(Also, excuse the MATLAB syntax.)

Here are two subplots:

  • Top: Plot that relationship straightforwardly
  • Bottom: Plot it using the contrived variable from the video

Shiva nonsense

(In Shiva's plot, there is of course the random scatter of real data around those lines.)

He then claims that this shape is somehow evidence of Biden stealing votes from Trump.

I have admittedly over-simplified a bit, for the sake of making my fundamental point more directly. But I think this is at the heart of Shiva's plot. I think he is obscuring truth, not revealing it.

Shiva does other deceptive things on the plot, like adding lines to "guide the eye", which, if you ignore them, you realize do not actually follow the data. There are also edge effects on the plot, that he ignores. Finally, he also makes verbal statements that are similarly deceiving.

I rate the video 1 out of 10. Would not watch again. (Disclaimer: I only watched the first 37 minutes before writing this.)

It's exactly what happens with their plots, that's the whole story :) Well said.

@chavenor
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests