In [1]:
import glob
import os
import pandas as pd 
import numpy as np

### Defining the files directory and paths
This is important for the Importing phase

In [2]:
dirs = glob.glob('/home/bennour/Projects/my_repos/elections/super_final_data/*.csv')

Getting the names of the files and set them in a list, This important to use it in loop in the analysis phase

In [3]:
fnames = os.listdir('/home/bennour/Projects/my_repos/elections/super_final_data')
names = [os.path.splitext(fnames[i])[0] for i in range(27)]

### Phase : Importation
Import the files with pandas in a dictionary to loop on in the analysis
###### we can also think about importing the data from the github repo, further preprocessing is needed is done. 

In [4]:
dfs = {d: pd.read_csv(d) for d in dirs}
dfs = dict(zip(names, list(dfs.values())))

### Phase : Create a preperation function
We Have to prepare the data in 'dfs' to get the total votes for each list, however this is not the only thing that we need to do, we also have to define which depend on the quota calculation process.

In [5]:
# this function takes the dataframe sums row wise for each candidate 
# and return list name and the total votes
def prep(df):
    # act on the data frame and process it 
    df_0 = pd.concat([pd.DataFrame({'sumv':np.sum(df, axis=1)}), df['list']], axis = 1)
    return df_0

### Phase: Define and prepare  Hare quota arguments
Hare quota is the number of all votes in given city divided by the number of seats for that city. 
Based on the HQ we will create a table where we have:
- electoral quota for each list: Q = 1 means list gets 1 seat etc.
- seats collected fully by the votes quota, quota seats: QS.
- remains R from the quota: votes who didn't got any seats to the list.
- percentage P of the votes of the list from all votes.

In [6]:
def hare(df, s):
    #total votes:
    ts = np.sum(df.sumv, axis = 0)
    #hare quota:
    hq = np.round(ts/s,decimals=3)
    #hare quota per list
    df['q'] = df.sumv/hq
    #quota seats
    df['qseats'] = np.fix(df.q)
    #remains
    df['r'] = df.q - df.qseats
    #percentage
    df['p'] = df.sumv/ts
    #sort the values with the highest remains first
    df = df.sort_values('r', ascending = False)
    return df

In [7]:
hare(prep(dfs['sousse']), 10).head()

Unnamed: 0,sumv,list,q,qseats,r,p
14,102604,قائمة حركة نداء تونس,4.929093,4.0,0.929093,0.492909
9,12360,قائمة حزب آفاق تونس,0.593774,0.0,0.593774,0.059377
15,50820,قائمة حزب حركة النهضة,2.441391,2.0,0.441391,0.244139
41,8626,قائمة حزب المبادرة,0.414393,0.0,0.414393,0.041439
33,5502,قائمة الجبهة الشعبية,0.264316,0.0,0.264316,0.026432


### Phase : Prepare and perform computations for seats allocation
In this phase we will impliment the largest remains allocation method for a given dataset. We will also give the opportunity to assign a minimum percentage of representation for lists to be accorded the remained seats. 
For each list sorted with regard to it's remains, and satisfying the condition on the percentage, we add a seat, untill all lists are given one, if more seats are still not allocated we iterate again with the same order untill we have none, or untill the first conditioon is met and we repeat.

In [8]:
def min_p(df,p):
    dff = df.loc[df['p'] > p]
    return dff

In [9]:
def seats(df, s):
    df = df.reset_index()
    rs = np.int64(s - np.sum(df.qseats))
    while np.sum(df.qseats) < s:    
        for i in range(len(df)):
            df.qseats[i] = df.qseats[i] + 1  
    dff = df.loc[df.qseats > 0]
    return dff

### Phase : Combine all processing and computations
For future ease of use, testing, and debugging, it is convinient to create a function that combine all of the above. 
Let's call it results().

In [10]:
def results(df, s, p):
    return seats(min_p(hare(prep(df), s),p), s)

In [11]:
results(dfs['sousse'], 10, 0.05)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,index,sumv,list,q,qseats,r,p
0,14,102604,قائمة حركة نداء تونس,4.929093,6.0,0.929093,0.492909
1,9,12360,قائمة حزب آفاق تونس,0.593774,2.0,0.593774,0.059377
2,15,50820,قائمة حزب حركة النهضة,2.441391,4.0,0.441391,0.244139


### Phase : Get data on seats for each region
Using data from the wikipedia article on regional dispatching of parliamentary seats in Tunisia, [link]('https://ar.wikipedia.org/wiki/%D9%82%D8%A7%D8%A6%D9%85%D8%A9_%D8%A7%D9%84%D8%AF%D9%88%D8%A7%D8%A6%D8%B1_%D8%A7%D9%84%D8%A7%D9%86%D8%AA%D8%AE%D8%A7%D8%A8%D9%8A%D8%A9_%D9%81%D9%8A_%D8%AA%D9%88%D9%86%D8%B3#%D8%A7%D9%84%D8%AF%D9%88%D8%A7%D8%A6%D8%B1_%D8%A7%D9%84%D8%A7%D9%86%D8%AA%D8%AE%D8%A7%D8%A8%D9%8A%D8%A9_%D8%AF%D8%A7%D8%AE%D9%84_%D8%AA%D9%88%D9%86%D8%B3').

Ofcourse we will not be needing all the page, onl the table, the region, and seats associated.

In [12]:
sieges = pd.read_html('https://ar.wikipedia.org/wiki/%D9%82%D8%A7%D8%A6%D9%85%D8%A9_%D8%A7%D9%84%D8%AF%D9%88%D8%A7%D8%A6%D8%B1_%D8%A7%D9%84%D8%A7%D9%86%D8%AA%D8%AE%D8%A7%D8%A8%D9%8A%D8%A9_%D9%81%D9%8A_%D8%AA%D9%88%D9%86%D8%B3#%D8%A7%D9%84%D8%AF%D9%88%D8%A7%D8%A6%D8%B1_%D8%A7%D9%84%D8%A7%D9%86%D8%AA%D8%AE%D8%A7%D8%A8%D9%8A%D8%A9_%D8%AF%D8%A7%D8%AE%D9%84_%D8%AA%D9%88%D9%86%D8%B3')[1]

In [13]:
sieges.head()

Unnamed: 0,الموقع,الدائرة الانتخابية,الأماكن,المقاعد
0,تونس (199 مقعد),أريانة,ولاية أريانة,8
1,تونس (199 مقعد),باجة,ولاية باجة,6
2,تونس (199 مقعد),بن عروس,ولاية بن عروس,10
3,تونس (199 مقعد),بنزرت,ولاية بنزرت,9
4,تونس (199 مقعد),قابس,ولاية قابس,7


In [14]:
sieges = sieges.iloc[:,[1,3]]

In [15]:
# we will reorder the names, and fix the arrangemnet with regard to the seats in the table above
names.sort()
# this order vector will be of the outmost importance in later stages
order = [0,1,2,3,4,5,6,7,8,9,12,10,11,13,14,15,16,17,18,19,20,21,22,23,24,25,26]
namesnew = [names[i] for i in order]

In [16]:
sieges['gov'] = namesnew
sieges = sieges.iloc[:,1:].rename(columns = {'المقاعد' : 'seats'})

In [17]:
s = dict(zip(namesnew, list(sieges.seats)))

In [18]:
s

{'ariana': 8,
 'beja': 6,
 'ben_arous': 10,
 'bizerte': 9,
 'gabes': 7,
 'gafsa': 7,
 'jendouba': 8,
 'kairouan': 9,
 'kasserine': 8,
 'kebili': 5,
 'mannouba': 7,
 'kef': 6,
 'mahdia': 8,
 'mednine': 9,
 'monastir': 9,
 'nabeul1': 7,
 'nabeul2': 6,
 'sfax1': 7,
 'sfax2': 9,
 'sidibouzid': 8,
 'siliana': 6,
 'sousse': 10,
 'tataouine': 4,
 'tozeur': 4,
 'tunis1': 9,
 'tunis2': 8,
 'zaghouan': 5}

### Phase : Constructing the results 
We will now use the table above, to loop for each region, it's associated dataset in 'dfs', pass it to the functions one by one , using also the corresponding number of seats for each specified in the column seats. 

We will associate the results into a dictionary we will call it 'fr', for final results, each key will be looped on as the name of the data set, and each value will be resulted dataset from the results function of the looped upon dfs dictionary.

In order to do this iteration, We will need to reorder the arrangement of names on which we created the 'dfs' dictionary, therefore, we need to reimport the data again in a proper manner.

In [19]:
dirs.sort()
dirsnew = [dirs[i] for i in order]

In [20]:
dfs = {d: pd.read_csv(d) for d in dirsnew}
dfs = dict(zip(namesnew, list(dfs.values())))

In [21]:
res = {n : results(dfs[n], s[n], 0) for n in namesnew}

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


### Phase : Cleaning the names of the lists.
In this section I discoverred That some names in arabic have been written in a terrible manner, this could create unconsistent results later on in the plotting, therefore we need to make the names of the lists that went to the parliament are clean.

In [22]:
winners = ['قائمة حركة نداء تونس',
           'قائمة حزب حركة النهضة',
           'قائمة حزب الاتحاد الوطني الحر',
           'قائمة الجبهة الشعبية',
           'قائمة حزب التيار الديمقراطي',
           'قائمة حزب التحالف',
           'قائمة حزب المؤتمر من أجل الجمهورية',
           'القائمة المستقلة الإقلاع',
           'قائمة حزب صوت الفلاحين',
           'قائمة  تيار المحبة',
           'قائمة حركة الشعب',
           'قائمة حزب آفاق تونس',
           'قائمة الجبهة الوطنية للإنقاذ',
           'قائمة الوفاء لمشروع الشهيد',
           'قائمة حزب المبادرة',
           'قائمة المجد الجريد' ]

In [23]:
res[namesnew[17]].list = [winners[i] for i in [1,3,4,0]]
res[namesnew[4]].list = [winners[i] for i in [1,2,6,0]]
res[namesnew[25]].list = [winners[i] for i in [1,0,11]]
res[namesnew[24]].list = [winners[i] for i in [0,4,3,1]]
res[namesnew[23]].list = [winners[i] for i in [15,0,2,1]]
res[namesnew[19]].list = [winners[i] for i in [9,3,1,12,13,0]]
res[namesnew[15]].list = [winners[i] for i in [0,1,11]]
res[namesnew[14]].list = [winners[i] for i in [1,11,3,0]]
res[namesnew[11]].list = [winners[i] for i in [0,3,2,1]]
res[namesnew[8]].list = [winners[i] for i in [1,3,2,6,0]]
res[namesnew[3]].list = [winners[i] for i in [1,5,2,0]]
res[namesnew[2]].list = [winners[i] for i in [0,1,2,3,4]]

ValueError: Length of values does not match length of index

In [None]:
pure = [res[i][['list','qseats']] for i in namesnew]

In [None]:
w = pd.concat(pure)
l = w['list'].drop_duplicates()

In [None]:
order = {i : w[w['list'] == i] for i in l}

In [None]:
order2 = {i : order[i].drop('list',axis = 1).sum() for i in l}
win = pd.DataFrame(order2).transpose().reset_index()
win.columns = ['list','seats']

In [None]:
win.to_csv("/home/bennour/Projects/elections/seats.csv")

In [None]:
import plotly.express as px

In [None]:
pie = px.pie(win ,values='seats', names = 'list')
pie.show()