# 01 Importing data
For showing what we've done so far, the following parsing is only done on a fraction of the overall dataset. 

In [1]:
import os
import re
from lxml import etree
from datetime import datetime

In [93]:
project_path = '/Users/robin/GIT/ADA/ADA2017_GroupWork/Project/'
# path to JDG .xml's
path = '/Users/robin/GIT/ADA/JDG/'

Defining helper functions to be used for data parsing:

In [3]:
def month_dates(start, end):
    """
    Returns an array of strings giving the path
    of which files to be processed
    """
    f = lambda date: date.month + 12 * date.year

    res = []
    for tot_m in range(f(start)-1, f(end)):
        y, m = divmod(tot_m, 12)
        res.append(str(y) + '/' + '%02d' % (m+1))
    
    return res

In [4]:
def get_date(article):
    """
    This method returns the date of the article
    """
    str_date = article.find('entity').find('meta').find('issue_date').text
    return datetime.strptime(str_date, '%d/%m/%Y')

In [5]:
def get_articles_in_file(file, start_date, end_date):
    articles = []  
    for article in file.iter('article'):
        if article.find('entity') is not None:
            a = ''
            date = get_date(article)
            if start_date <= date <= end_date:
                for entity in article.iter('entity'):
                    a += entity.findtext('full_text') + ' '
                articles.append(date.strftime('%d/%m/%Y') + ' ' + a)
    return articles

In [6]:
def get_articles(path, start_date, end_date):
    articles = []
    for m_date in month_dates(start_date, end_date):
        try:
            file = etree.parse(path + m_date + '.xml')
            articles.append(get_articles_in_file(file, start_date, end_date))
        except (FileNotFoundError, IOError):
            pass
    return [a for file in articles for a in file]  

In [7]:
def get_entity_text(file, box_id):
    res = None
    for article in file.iter('article'):
        if article.find('entity') is not None:
            date = get_date(article)
            for entity in article.iter('entity'):
                if   box_id == entity.find('meta').find('box').text:
                    res = date.strftime('%d/%m/%Y') + ' ' + entity.findtext('full_text')
                    break
    return res

We only parse a fraction of the data here, defined with `start_date` and `end_date`

Parsing all files using previous helper functions and the lxml parser:

In [8]:
start_date =  datetime(1990, 1, 1)
end_date = datetime(1998, 2, 28)

In [9]:
articles = get_articles(path, start_date, end_date)

In [10]:
len(articles)

358455

## Data cleaning and filtering
### Lemmatisation
We will be using the NLP library `spacy` to clean articles into more usable data. We'll be doing a lot of lemmatisation, whereby all words are reduced to a 'root' form, removing conjugation, pre- and suffixes and other such inflected forms. This will make it much easier to compare words between articles, as inflected forms will be transformed into the same lemma comparable across texts. For example, the words:

> voter, votation, votons, vote, voté

Will all be changed to the same 'root' word **voter**

### Naive selection
We next want to filter out articles which are not clearly in direct link to voting. Although we could have done a round of clustering to identify topics and then choose only those from the 'voting' cluster, we felt it better to use a combination of lemmatisation and 'naive selection', whereby articles are filtered depending on whether they contain any one of a list of keywords in relation to voting. The issue with performing clustering at this stage would be precision. We would then be performing two rounds of clustering, 1 to identify articles on voting, and another to identify the topics of these articles. We would rather sacrifice on recall and have higher precision early on so that we are sure that all selected articles are definitely on voting. Based on some initial trials, we found that it was an issue if non-voting articles made it to our later round of clustering, where it would throw off the clustering. Furthermore, the naive selection still left us with a large number of articles, several hundreds per year, which we decided was enough for our needs.

At this point, we chose to do the naive selection before lemmatisation for performance reasons. We can surely increase the number of articles that we obtain without reducing precision if we perform the selection on lemmatised articles.

In [11]:
import spacy # NLP library
import fr_core_news_sm # Model to be used with spacy
import enchant # spellchecking

In [12]:
nlp = fr_core_news_sm.load()

In [13]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuaiton or whitespace
    """    
    return token.is_punct or token.is_space

In [14]:
def is_french(word):
    """
    helper function to eliminate tokens that
    are not french words.
    """
    d = enchant.Dict('fr_FR')
    return d.check(word)

In [30]:
def lemmatized_corpus(corpus):
    """
    generator function to use spaCy to parse articles,
    lemmatize the text, and yield sentences
    """
    j = 0 # todo: remove both iterators?
    i = 0
    for parsed_article in nlp.pipe(corpus, 
                                   batch_size=100, n_threads=5):
        # save the date
        date = parsed_article[0].text
        
        yield (date, ' '.join([token.lemma_ for token in parsed_article
                             if not punct_space(token) and is_french(token.text)
                                and not token.is_stop and not token.is_digit
                                and not token.like_num]))

In [16]:
def corpus_votation(articles, lemmas):
    votations = []
    for article in articles:
        if any(lemma in article for lemma in lemmas): 
            votations.append(article)
    return votations

In [17]:
def corpus_votation_bis(articles, lemmas):
    votations = []
    for article in articles:
        if any(lemma in article.replace(' ', '') for lemma in lemmas):
            votations.append(article)
    return votations

This is the naive selection part, where all articles that do not contain at least one of the keywords are filtered out:

In [None]:
def article_selection(articles,keywords):
    
    articles_votation = []

    for art in articles:
        if any(word in art.split() for word in keywords):
            articles_votation.append(art)
    return articles_votation

In [91]:
# Naive selection (First Filtering)
lems = ['votation', 'referendum'] # todo: try adding 'initiative'

#articles_votation_third = corpus_votation_bis(articles, lemmas)
#articles_votation_bis = corpus_votation(articles, lemmas)

articles_votation = article_selection(articles, lems)

In [19]:
len(articles_votation)

2561

Here all articles are lemmatised, this step is relatively expensive in computation time. This could be parallelised in the future, by splitting up the set of articles into a certain number of groups and assigning one core to lemmatise each group.

In [31]:
# todo: parallelize and/or speed-up lemmatization
if 0 == 1:
    %%time
    # Time consuming !!
    lemmatized_corpus = [(date, lemmas) for date, lemmas in lemmatized_corpus(articles_votation)]

    # retrieve dates
    dates = [pair[0] for pair in lemmatized_corpus]

    # retrieve articles
    corpus = [pair[1] for pair in lemmatized_corpus]

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 4.53 µs


Dumping lemmatized articles to text:

In [32]:
len(lemmatized_corpus)

2561

In [None]:
if 0 == 1:
    with open(os.path.join(project_path, 'lemmatized articles 1990-1998.txt'), 'w') as file:
        for article in corpus:
            file.write(article + '\n')

In [94]:
if 1 == 1:
    with open(os.path.join(project_path, 'lemmatized articles 1990-1998.json'), 'w') as file:
        json.dump(lemmatized_corpus, file)

In [30]:
with open(os.path.join(project_path, 'lemmatized articles 1990-1998.json'), 'r') as file:
    lemmatized_corpus = json.load(file)

In [None]:
if 0 == 1:
    # check ouput of lemmatizer (lemmatized_corpus) 
    file = etree.parse('/home/mbanga/Desktop/JDG/1990/01.xml')
    box_id = '24 123 1446 2167'

    original_text = [get_entity_text(file, box_id)]

    for lemmatized in lemmatized_corpus(original_text):
        print(lemmatized[1], '\n')
    print(original_text)

# 02 Unsupervised ML: LDA
The next step is to identify the topics of all articles. This is a typical example of unsupervised ML whereby we want to assign labels in order to differentiate members (in this case articles) in a set of data. 

In [None]:
if 0 == 1:
    # check naive selection
    file = etree.parse('/home/mbanga/Desktop/JDG/1990/01.xml')
    box_id = '50 163 1090 888'

    original_text = [get_entity_text(file, box_id)]
    lemmas = ['vote', 'voter', 'votation', 'referendum']
    res = corpus_votation(original_text, lemmas)

# Filtering articles about votations

> Assumption: The subject of a votation is most likely to be found in
the neighborhoud of the terms 'votation' or 'referendum' in the article. 

In [153]:
# get all phrases index with the searched term
word = 'votation'
j = 0
for article in articles_votation[:10]:
    date = re.findall(r'^([^\s]+)', article)
    print('article', (j+1), ': ', date)

    phrases = article.split('.')
    interest = []
    for i, phrase in enumerate(phrases):
        if word in phrase:
            print(' {:}'.format(phrases[i-1] + phrase + phrases[i+1]))
    print('\n')
    j += 1

article 1 :  ['01/01/1990']
  Il entend montrer à tous ceux qui se sont exprimés pour l'armée ou qui, pour des raisons personnelles, ont soutenu l'initiative, que la confiance placée en l'armée est méritée Après une campagne qui s'est déroulée sans coups au-dessous de la ceinture, -les minoritaires devraient, selon lui, se comporter en démocrates, c'est-à-dire accepter Je résultat de la votation et laisser travailler les responsables


article 2 :  ['05/01/1990']
  Or, au niveau du principe, qui recouvre la reconnaissance du besoin de la ligneet le choix d'un traçé, ces auiorisations ont été données »,, Le présidentd'EOS relève aussi, qu'après l'échec, en votation populaire, de l'initiative « Sauver -la Côte », et les recours qui ont suivi, cette autorisa- tion dep ~ incipe a été confirmée par le ConseiI :'fép £ ~ t, i < 1 L £ st ab


article 3 :  ['06/01/1990']
  Les Genevois devront donc dire si, oui ou non, ils acceptent le projet tel qu'il a été voté le 10 octobre dernier par le Co

In [137]:
word = 'votation'

j = 0
for article in articles_votation[-10:]:
    print('article', (j+1), ':')
    ind = article.find(word)
    for phrase in article.split('.'): #re.findall(r'\.([^.])\.', article):
        if word in phrase:
            print('  ', phrase)
    print('\n')
    j += 1

article 1 :
   14/02/1998 La votation sur la taxe poids lourds liée aux prestations aura lieu le 27 septembre si le référendum aboutit Lors des traditionnels entretiens de Watteville, les quatre partis gouvernementaux ont également déclaré qu'il était hors de question de développer des prestations sociales sans en assurer le financement
    D'ici fin 1999,24 objets pourraient être soumis à votation
    Les quatre partis gouvernementaux (PS, PRD, PDC, UDC) ont été informés vendredi du calendrier des prochaines votations fédérales lors des entretiens de Watteville avec le Conseil fédéral
    « Oui à l'Europe » Le Conseil fédéral n'a cependant pas pu préciser aux partis la date fixée pour la votation sur l'initiative « Oui à l'Europe », qui demande des négociations d'adhésion immédiates à l'UE


article 2 :
    La votation devrait se dérouler au mois de juin prochain


article 3 :
   14/02/1998 Génie génétique Les libéraux disent non à l'initiative L'assemblée des délégués du Parti libéra

In [108]:
articles_votation[0]

'01/01/1990 ARMÉE Heinz Haesler : cc L\'armée doit vivre avec son. temps » Les projets du nouveau chef de l\'état-major général à l\'aube de son entrée en fonction ". (A TS).-L\'armée aussi doit vivre avec son temps. estime le commandant de corps Heinz llaesler, 59 ans, qui deviendra lé Ier [anvter chef de l\'êtat-majer gênêrabe Cela signifie éliminer toutes les casseroles qui nous apportent peu mais nous causent beaucoup de tort », a-t-il expliqué vendredi à l\'Agence télégraphique suisse (ATS)., Lepoint fortde ma nouvelle activité sera la mis ~ en œuvre du projet « Armée 95 » pour laquelle une décennie ne sera pas de trop, affirme Heinz Haesle. r. "M ~ isparallêlement, il souhaite prendre toute une série de mesures â\'court, moyen et long terme\'qui devraient permettre à l\'armée de « vivre avec son temps ». Place aux démocrates Le nouveau chef Q ~ l\'état-major général est conscient du climat particulier créé par le résultatétonnant de l\'initiative pour la, suppression de l\'armée.

# Latent Dirichlet Allocation

In [34]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence
from gensim.models.ldamulticore import LdaMulticore
from gensim.corpora import Dictionary, MmCorpus

import pyLDAvis
import pyLDAvis.gensim
import warnings
#import cPickle as pickle

In [35]:
# learn the dictionnary by iterating over all of the articles
dico = Dictionary([article.split() for article in corpus])

# filter tokens that are very rare or too common from
# the dictionary 
dico.filter_extremes(no_below=0, no_above=0.4)

# reassign integer lda
dico.compactify()

In [36]:
def bow_generator(corpus):
    """
    generator function to read articles from a file
    and yield a bag-of-words representation
    """
    for article in corpus:
        yield dico.doc2bow(article.split())

In [37]:
# generate bag-of-word representations for
# all reviews and save them as a matrix
MmCorpus.serialize(os.path.join(project_path, 'corpus.mm'),
                                bow_generator(corpus))

bow_corpus = MmCorpus(os.path.join(project_path, 'corpus.mm'))

In [38]:
lda_model_filepath = os.path.join(project_path, 'lda_model_all')

In [60]:
if 1 == 1:
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')

        # workers => sets the parallelism, and should be
        # set to your number of physical cores minus one
        lda = LdaMulticore(bow_corpus,
                           num_topics=100,
                           id2word=dico,
                           workers=5)
        
        lda.save(lda_model_filepath)

#load the finished LDA model from disk
lda = LdaMulticore.load(lda_model_filepath)

In [61]:
def explore_topic(topic_number, topn=25):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
    
    print(u'{:20} {}'.format(u'term', u'frequency') + u'\n')
    
    for term, frequency in lda.show_topic(topic_number, topn):
        print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))

In [62]:
explore_topic(topic_number=11, topn=10)

term                 frequency

million              0.007
genève               0.006
demander             0.004
public               0.004
page                 0.003
franc                0.003
traverser            0.003
financier            0.003
voir                 0.003
taxer                0.003


In [104]:
# The goal is to find all documents related to the same topic
def articles_topic(lda, bow_corpus, corpus, topic):
    """
    return the list of articles associated
    with a given topic.
    """
    assert len(bow_corpus) == len(corpus)
    nb_topics = len(lda.get_topics())
    
    documents = []
    if 0 <= topic < nb_topics:
        k = 0
        for bow_article in bow_corpus:
            dist = lda.get_document_topics(bow_article, minimum_probability=0)
            dist = [p[1] for p in dist]
            idx_max = dist.index(max(dist))
            if idx_max == topic:
                documents.append(corpus[k])
            k += 1
    
    return documents

In [105]:
docs = articles_topic(lda, bow_corpus, articles_votation, 65)

In [103]:
docs[4]

"19/11/1991 TRAITÉ EEE Comité directeur d e ruSAM oppos é Le comité directeur de l'Union suisse des arts et des métiers (USAM) a demandé à son organe suprême, la Chambre suisse des arts et métiers, de s'opposer au Traité sur l'Espace économique européen (EEE). L'EEE entraînera le démantèlement des droits des citoyens, des cantons et du Parlement, estime l'USAM dans un communiqué publié lundi. Ainsi, les droits de référendum et d'initiative seront limités. En outre, le Traité n'est pas considéré comme un but en soi mais comme une étape vers l'adhésion à la Communauté européenne (CE). L'USAM demande en conséquence au Conseil fédéral de faire connaître la date prévue pour le dépôt d'une demande d'adhésion à la CE. Les citoyens se prononceront ainsi en toute connaissance de cause lors de la votation sur l'EEE. (ATS) "

In [102]:
type(lda)

gensim.models.ldamulticore.LdaMulticore

In [66]:
if 1 == 1:     
        LDAvis_prepared = pyLDAvis.gensim.prepare(lda, bow_corpus, dico)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  topic_term_dists = topic_term_dists.ix[topic_order]


In [67]:
pyLDAvis.display(LDAvis_prepared)

# Research questions and updated goals
As we have gotten to grips with the data and started processing it we now have a clearer idea of how we are going to answer our original research questions, as well as modifications of our goals and methods

> Do we have the freedom to vote on any question we wish?

In retrospect this question is very general, and serves more as an introduction to the more specific subsequent questions we ask. It doesn't have a clear yes or no answer, but is most closely related to the question concerning direct democracy.

> Do the same votations, or topics, keep coming up? Is this throughout history or during specific periods?

To answer this question we would first perform LDA on our entire set filtered with our votation keywords. We would then compare date distributions of articles in different clusters. By seeing these distributions, we should be able to determine if certain topics were only being voted on at a certain point in the last 200 years, or if they regularly re-appear.

>Are the results of these repeated votations changing throughout time?

Given the work load and time pressure we have decided to drop this question. Accurately answering this question would require several additional layers of analysis. We would first have to identify the actual votations that potentially multiple articles refer to. Finding an automatic way to categorise whether a vote is accepted or not is another totally independent challenge too.

>Can we link big changes in technological and societal norms to previous votations?

The best way we have thought of to tackle this question involves first choosing several important changes in technology and society. We may then try to correlate these technology and society changes to big changes in the topics of votations, for example by seeing if a new votation topic concerning cars and roads appeared around the same time as the first industrially produced affordable cars.

>How has direct democracy been used? Is it increasing or decreasing? What kind of votations is it used for?

To answer this question we plan to perform LDA on our entire dataset to get an initial view of votation topics. Then we would compare this with LDA on a set of our data which is filtered with keywords relating only to the direct-democracy 'initiatives populaires'. We may first simply compare the size of these two sets over time, seeing if there are more or less 'initiatives populaires' throughout time. By comparing the topics that may be missing or not as represented in one set we may also draw some conclusions as to how direct democracy has been used, specifically on which topics or area of topics.

### dropped goals
* Increasing our data with the swiss tweets dataset. Again, given the time pressure we think it is best to focus on the news set first and foremost.
* Votation results. As discussed above, due to the more complex nature of this question we think it better to be dropped.
* Spark and cluster processing. As our initial dataset is not of a huge size, and that we discard a large proportion of it when we perform our naive filtering, we decided that the added overhead of learning spark wasn't worth it.