## Forensics of the 2016 State Duma (legislative) Elections  in Russia 

### Benford's law of leading digits and its application in election forensics

This class project present the analysis of 2016 legislative elections in Russian Federation. The electoral precinct level data is analyzed using Benford's Law, a [phenomenological](http://mathworld.wolfram.com/BenfordsLaw.html) principle describing the probabilities of leading digit occurence in data tables. 
Benford's first digit law implies that the probability of a leading  digit (d) in a table of physical or statistical data can be calculated using the [formula](http://mathworld.wolfram.com/BenfordsLaw.html):

$$
P(d)=  \frac{log(1+\frac{1}{d})}{log},
$$

This equation can be given in a [recurrence form](http://www.statisticalconsultants.co.nz/blog/benfords-law-and-accounting-fraud-detection.html):
$$
P_{n}(d)= \sum_{k=10^{n-2}}^{n=10^{n-1}-1} log_{10} (1+ \frac{1}{10k+d})
$$
where n is n-th digit of number d

Considering this, the probability of occurence for each digit will be as follows:

Digit (d) | Probability of being a leading digit(P)
-----------|-----------
1	| 0.30103
2	| 0.176091
3	| 0.124939
4	| 0.09691
5	| 0.0791812
6	| 0.0669468
7	| 0.0579919
8	| 0.0511525
9	| 0.0457575

Originally discovered by [Simon Newcomb](http://www.jstor.org/stable/2369148?origin=crossref&seq=1#page_scan_tab_contents), the 'law' was rediscovered by American physicist [Frank Benford](https://www.jstor.org/stable/984802?seq=1#page_scan_tab_contents). Benford's law has been applied for detecting [tax evasion](http://search.proquest.com/docview/211023799?pq-origsite=gscholar), [detecting irregularities in accountancy](https://www.researchgate.net/profile/Cindy_Durtschi/publication/241401706_The_Effective_Use_of_Benford's_Law_to_Assist_in_Detecting_Fraud_in_Accounting_Data/links/54982f4a0cf2c5a7e342a59e.pdf), [fraudulent scientific data](http://www.tandfonline.com/doi/abs/10.1080/02664760601004940) and the most recently, for checking irregularities in [elections](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.556.5055&rep=rep1&type=pdf). Although the method is a very handful tool for examining electoral manipulation, it has been felt to a serious criticism (see [here](https://www.researchgate.net/publication/227450786_When_Does_the_Second-Digit_Benford's_Law-Test_Signal_an_Election_Fraud_Facts_or_Misleading_Test_Results) and [here](http://pan.oxfordjournals.org/content/19/3/269.short)).

[Walter Mebane](http://www-personal.umich.edu/~wmebane/) of the University of Michigan and the colleagues have developed a very comprehensive election forensics toolkit in [R](http://www.electiondataarchive.org/forensics.html). A very simple implementation of one element of this toolkit is given below.

### Why Russia

On September 18<sup>th</sup> Russia hold legislative elections. As it was [expected](https://www.washingtonpost.com/news/monkey-cage/wp/2016/09/16/no-you-wont-see-russians-protesting-a-putin-win-on-sunday-even-if-his-party-really-lost/), ruling United Russia party managed to win a comfortable [majority](http://www.reuters.com/article/us-russia-electon-idUSKCN11N0T6) in the State Duma, followed by closely associated parties, such as nationalist Liberal Democratic Party, Communist Party of Russia and center-left Just Russia. 

![In Russia, presidents ride bears](putin_riding_bear.jpg)

Although the pollsters have been predicting a [comfortable victory](https://www.bloomberg.com/view/articles/2016-07-08/russia-has-the-most-boring-election-of-2016) for Vladimir Putin's United Russia, there were many reports of election [irregulations](http://www.reuters.com/article/us-russia-electon-idUSKCN11N0T6) including ballot stuffing and so called 'carousels'. The cases were even recorded through the [web cameras](https://www.youtube.com/watch?v=iRqEMIdSDGE) installed at voting precints. Moreover, Russian data scientist, Sergey Shpilkin made a series of [claims](http://podmoskovnik.livejournal.com/175574.html) arguing that at some constituencies election results have been made up. The allegations made Shpilkin more or less hold together when the evidences from the [parallel vote tabulation](http://www.sms-cik.org/) are added to the picture.

In the analysis below, I use scraped results of proportional voting compiled by Shpilkin which is freely available through [Google Drive](https://drive.google.com/drive/u/0/folders/0ByFMnUnpIlriSmZtNU4tTldkZjg). For convenience purposes, I only use vote share figures for ruling United Russia and final turnout values.


In [234]:
import pandas as pd
import numpy as np
from scipy.stats import chisquare
import matplotlib.pyplot as plt
plt.rcdefaults()
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [235]:
path="table_233_level_4.csv"
df=pd.read_csv(path, sep='\t')
colNames=["Region", "Regional_Election_Commission", "District_Election_Commission", "Precinct", "Total_voters_registered", "Total_ballots_received", "Total_ballots_for_early_voting", "Total_ballots_used_inside_the_building", "Total_ballots_used_outside_the_building", "Total_unused_ballots", "Total_ballots_in_moving_box", "Total_ballots_at_non_moving_box", "Total_invalid_ballots", "Total_valid_ballots", "Total_temporary_id_cards", "Total_temporary_id_cards_issued_at_spot", "Total_temporary_id_cards_voters", "Total_temporary_id_not_used", "Total_temporary_id_issued_by_precinct_commission", "Total_temporary_id_lost", "Total_ballots_lost", "Total_ballots_unused", "Rodina", "Communists_of_Russia", "Russian_Pensioners", "United_Russia", "Green_party", "Civic_platform", "Liberal_democratic_party", "PARNAS", "Party_of_growth", "Civil_might", "Yabloko", "Communist_party_of_Russia", "Patriots_of_Russia", "Just_Russia", "url"]
df.columns=colNames

### Data manipulation
Let's reshape the data frame into tidy format. In the final data frame, columns "Region" and "Precinct" form id for the observations. column "variable" will contain definitions for the votes cast for the United Russia ("United_Russia") and the total number of voters who showed up at polling stations.

As the dataset does not explicitly give us turnout numbers, I'm calculating "turnout" column by summing up valid and invalid ballots. Next, I'm subsetting the necessary variables from the source data frame and finally, melting it into tidy form.

Results of three polling stations were annulled, consequently I will exclude them from the analysis

In [236]:
df['Turnout']=df['Total_valid_ballots']+df['Total_invalid_ballots']
ruElect=df[['Region', 'Precinct', 'United_Russia', 'Turnout']]
ruTidy = pd.melt(ruElect, id_vars=['Region', 'Precinct'], var_name="Variable", value_name="Value")
ruTidy=ruTidy[ruTidy.Value!=0]
ruTidy.head()

Unnamed: 0,Region,Precinct,Variable,Value
0,Республика Адыгея (Адыгея),УИК №1,United_Russia,1796
1,Республика Адыгея (Адыгея),УИК №2,United_Russia,1290
2,Республика Адыгея (Адыгея),УИК №3,United_Russia,1984
3,Республика Адыгея (Адыгея),УИК №4,United_Russia,1441
4,Республика Адыгея (Адыгея),УИК №5,United_Russia,240


Now, subset datased for "United_Russia" and check the first digits for the party's votes against Benford's law:

In [237]:
unRussia=ruTidy[ruTidy['Variable'].isin(['United_Russia'])]

### Calculation:

In [238]:
unRussia['first_digit'] = unRussia['Value'].astype(str).str[0]
tabFreq = pd.DataFrame(unRussia.groupby('first_digit').size().rename('counts'))
tabFreq = pd.DataFrame(tabFreq.counts)
tabFreq.columns=["myCounts"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Thus we've saved the observed distribution of first digit values. Compare it with the expected distribution by performing Chi-square test:

In [239]:
probs=pd.DataFrame({'Probability':[0.30103, 0.176091, 0.124939, 0.09691, 0.0791812, 0.0669468, 0.0579919, 0.0511525, 0.0457575]})
probs.expDistr=probs.Probability*95945
chisquare(probs.expDistr, tabFreq.myCounts, axis=None)

Power_divergenceResult(statistic=4406.3860802959643, pvalue=0.0)

Well, the result of the test statistic won't be surprising if we plot expected and observed distributions:

In [240]:
loc1=(1, 3, 5, 7, 9, 11, 13, 15, 17)
loc2=(2, 4, 6, 8, 10, 12, 14, 16, 18,)
ind =np.arange(1,19,2)
width= 0.35
ax = plt.subplot(111)
p1=ax.bar(loc1, tabFreq.myCounts, align='center', alpha=0.4, color='b')
p2=ax.bar(loc2, probs.expDistr, align='center', alpha=0.4, color='r')
ax.set_ylabel('Percentage')
ax.set_xlabel('First digits')
ax.set_title('United Russia: Observed and Expected Distribution of First Digits')
ax.set_xticks(ind+0.5)
ax.set_xticklabels(('1', '2', '3', '4', '5', '6', '7', '8', '9'))

plt.legend((p1[1], p2[1]), ('Observed', 'Expected'))
plt.show()

We can check second digits. However, we will need different inputs for the expected probabilities:

In [241]:
unRussia['second_digit'] = unRussia['Value'].astype(str).str[1]
tabFreq = pd.DataFrame(unRussia.groupby('second_digit').size().rename('counts'))
tabFreq = pd.DataFrame(tabFreq.counts)
tabFreq.columns=["myCounts"]
probs=pd.DataFrame({'Probability':[0.11968,0.11389,0.10882,0.10433,0.10031,0.09668,0.09337,0.09035,0.08757,0.085]})
probs.expDistr=probs.Probability*95945
chisquare(probs.expDistr, tabFreq.myCounts, axis=None)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Power_divergenceResult(statistic=38.848829118871095, pvalue=1.2266507773817321e-05)

In [242]:
loc1=(1, 3, 5, 7, 9, 11, 13, 15, 17, 19)
loc2=(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
ind =np.arange(1,21,2)
ax = plt.subplot(111)
p1=ax.bar(loc1, tabFreq.myCounts, align='center', alpha=0.4, color='b')
p2=ax.bar(loc2, probs.expDistr, align='center', alpha=0.4, color='r')
ax.set_ylabel('Percentage')
ax.set_xlabel('Second digits')
ax.set_title('United Russia: Observed and Expected Distribution of Second Digits')
ax.set_xticks(ind+0.5)
ax.set_xticklabels(('0', '1', '2', '3', '4', '5', '6', '7', '8', '9'))

plt.legend((p1[1], p2[1]), ('Observed', 'Expected'))
plt.show()

For turnout figures, we are following the same steps, however, subsetting only corresponding observations from the melted data frame

In [243]:
unRussia=ruTidy[ruTidy['Variable'].isin(['Turnout'])]
unRussia['first_digit'] = unRussia['Value'].astype(str).str[0]
tabFreq = pd.DataFrame(unRussia.groupby('first_digit').size().rename('counts'))
tabFreq = pd.DataFrame(tabFreq.counts)
tabFreq.columns=["myCounts"]
probs=pd.DataFrame({'Probability':[0.30103, 0.176091, 0.124939, 0.09691, 0.0791812, 0.0669468, 0.0579919, 0.0511525, 0.0457575]})
probs.expDistr=probs.Probability*95945
chitest=chisquare(probs.expDistr, tabFreq.myCounts, axis=None)
print(chitest)
loc1=(1, 3, 5, 7, 9, 11, 13, 15, 17)
loc2=(2, 4, 6, 8, 10, 12, 14, 16, 18,)

ind =np.arange(1,19,2)
width= 0.35
ax = plt.subplot(111)
p1=ax.bar(loc1, tabFreq.myCounts, align='center', alpha=0.4, color='b')
p2=ax.bar(loc2, probs.expDistr, align='center', alpha=0.4, color='r')
ax.set_ylabel('Percentage')
ax.set_xlabel('First digits')
ax.set_title('Turnout: Observed and Expected Distribution of First Digits')
ax.set_xticks(ind+0.5)
ax.set_xticklabels(('1', '2', '3', '4', '5', '6', '7', '8', '9'))
plt.legend((p1[1], p2[1]), ('Observed', 'Expected'))
plt.show()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Power_divergenceResult(statistic=7324.545017089772, pvalue=0.0)


In [244]:
unRussia['second_digit'] = unRussia['Value'].astype(str).str[1]
tabFreq = pd.DataFrame(unRussia.groupby('second_digit').size().rename('counts'))
tabFreq = pd.DataFrame(tabFreq.counts)
tabFreq.columns=["myCounts"]
probs=pd.DataFrame({'Probability':[0.11968,0.11389,0.10882,0.10433,0.10031,0.09668,0.09337,0.09035,0.08757,0.085]})
probs.expDistr=probs.Probability*95945
chitest=chisquare(probs.expDistr, tabFreq.myCounts, axis=None)
print(chitest)

loc1=(1, 3, 5, 7, 9, 11, 13, 15, 17, 19)
loc2=(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
ind =np.arange(1,21,2)
ax = plt.subplot(111)
p1=ax.bar(loc1, tabFreq.myCounts, align='center', alpha=0.4, color='b')
p2=ax.bar(loc2, probs.expDistr, align='center', alpha=0.4, color='r')
ax.set_ylabel('Percentage')
ax.set_xlabel('Second digits')
ax.set_title('Turnout: Observed and Expected Distribution of Second Digits')
ax.set_xticks(ind+0.5)
ax.set_xticklabels(('0', '1', '2', '3', '4', '5', '6', '7', '8', '9'))

plt.legend((p1[1], p2[1]), ('Observed', 'Expected'))
plt.show()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Power_divergenceResult(statistic=81.733160018791111, pvalue=7.309212180437401e-14)


It looks that both turnout and votes cast for the incumbent United Russia party, tests for the first and second digits reveal unusual pattern. In all cases, observed distributions significantly differ from what is expected. As it was mentioned above, Benford's law should be applied to elections with caution, however, 2016 Russian elections clearly show signs of electoral malpractice and manipulation.