[topic modelling](https://medium.com/@osas.usen/topic-extraction-from-tweets-using-lda-a997e4eb0985)
[An oireachtas debate](https://data.oireachtas.ie/akn/ie/debateRecord/dail/2020-07-22/debate/mul@/main.xml)

In [1]:
import urllib.request
import json
import pandas as pd
import re
from nltk import PorterStemmer
from nltk.corpus import stopwords
from nltk import word_tokenize

# Questions

In [39]:
def read_from_url_and_store(url, cache_file_name='cache.txt', limit=50):
    contents = urllib.request.urlopen(url.format(f'&limit={limit}'))
    code = contents.getcode()
    if code!=200:
        print(f'problem with load {code}')
    else:
        response = contents.read()
        with open(cache_file_name, 'wt', encoding='utf-8') as f:
            f.write(response.decode('utf-8'))
            
read_from_url_and_store('https://api.oireachtas.ie/v1/questions?\
date_start=1900-01-01&date_end=2099-01-01&qtype=written{}', 'questions.json', limit=100)

In [40]:
def get_from_cache_as_json(cache_file_name='cache.txt'):
    with open(cache_file_name, 'rt', encoding='utf-8') as f:
        return json.loads(f.read())
    
data = get_from_cache_as_json('questions.json')
print(len(data['results']))

100


In [75]:
for_pandas = []
for result in data['results']:
    row = result['question']
    for_pandas.append({'date': row['date'], 'showAs': row['showAs'], 'by': row['by']['showAs']})
questions = pd.DataFrame(for_pandas)

longest_question = max(questions['showAs'].apply(lambda x: len(x)))
pd.options.display.max_colwidth = longest_question
pd.options.display.max_rows = 100
questions.head()

Unnamed: 0,date,showAs,by
0,2020-10-20,"33. Deputy Emer Higgins asked the Minister for Children, Disability, Equality and Integration the details of the funding provided to crèches that was conditional on them not increasing their prices; the way in which the matter is being monitored; and if he will make a statement on the matter. [25943/20]",Emer Higgins
1,2020-10-20,"36. Deputy Violet-Anne Wynne asked the Minister for Children, Disability, Equality and Integration if he will carry out a review into the funding model for family resource centres and other community services under the remit of Tusla. [31356/20]",Violet-Anne Wynne
2,2020-10-20,"37. Deputy Catherine Connolly asked the Minister for Children, Disability, Equality and Integration the date on which the Tusla review into the provision of safe emergency accommodation for victims of domestic violence will be published; the analysis carried out by his Department and-or Tusla into the impact of the Covid-19 pandemic on the number of emergency accommodation spaces available for victims of domestic violence; and if he will make a statement on the matter. [30631/20]",Catherine Connolly
3,2020-10-20,"38. Deputy Denis Naughten asked the Minister for Children, Disability, Equality and Integration the steps he plans to take to improve the pay and conditions of childcare workers; and if he will make a statement on the matter. [30625/20]",Denis Naughten
4,2020-10-20,"39. Deputy Gary Gannon asked the Minister for Children, Disability, Equality and Integration his position and actions either taken by his Department or scheduled for the future on archival infrastructure to ensure truth for mother and baby homes. [31365/20]",Gary Gannon


In [77]:
questions['date'].unique()

array(['2020-10-20', '2020-10-15'], dtype=object)

In [62]:
# strip the leading number
# and the trailing number in []
# and replace the questioner by XXX in all questions
questions['showAs'] = questions['showAs'].apply(lambda x: x.lower())
questions['showAs'] = questions['showAs'].apply(lambda x: re.sub('^ *[0-9]+\.', '', x))
questions['showAs'] = questions['showAs'].apply(lambda x: re.sub('\[[0-9/]+\]*', '', x))

def strip_names(row):
    names = row['by'].split()
    names = [a.lower() for a in names]
    for name in names:
        row['showAs'] = re.sub(name, '', row['showAs'].lower())
    return row['showAs']

questions['showAs'] = questions.apply(lambda row: strip_names(row), axis=1)

questions['showAs'] = questions.apply(lambda row: re.sub(row['by'] if row['by'] else 'YYY', 'XXX', row['showAs']), axis=1)

p = PorterStemmer()

stop_words = stopwords.words('english')

my_stop_words = ['minister', 'deputy', 'make', 'statement', 'matter']
stemmed_stop_words = []
stemmed_stop_words += [p.stem(a) for a in my_stop_words]
my_stop_words+=stemmed_stop_words
my_stop_words+=stop_words

questions['showAs'] = questions['showAs'].apply(lambda x: ' '.join([a for a in word_tokenize(x) if a not in my_stop_words]))
questions['showAs'] = questions['showAs'].apply(lambda x: ' '.join([p.stem(a) for a in x.split()]))
questions['showAs'] = questions['showAs'].apply(lambda x: re.sub('[.;,()]', '', x))
questions['showAs'] = questions['showAs'].apply(lambda x: re.sub('\s+', ' ', x))


questions.head()

Unnamed: 0,date,showAs,by
0,2020-10-20,ask children disabl equal integr detail fund provid crèche condit increas price way monitor,Emer Higgins
1,2020-10-20,ask children disabl equal integr carri review fund model famili resourc centr commun servic remit tusla,Violet-Anne Wynne
2,2020-10-20,ask children disabl equal integr date tusla review provis safe emerg accommod victim domest violenc publish analysi carri depart and-or tusla impact covid-19 pandem number emerg accommod space avail victim domest violenc,Catherine Connolly
3,2020-10-20,ask children disabl equal integr step plan take improv pay condit childcar worker,Denis Naughten
4,2020-10-20,ask children disabl equal integr posit action either taken depart schedul futur archiv infrastructur ensur truth mother babi home,Gary Gannon


In [63]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
c = TfidfVectorizer()

In [64]:
transformed = c.fit_transform(questions['showAs']).toarray()

In [65]:
feature_names = c.get_feature_names()

In [66]:
import numpy as np
[feature_names[x] for x in [np.argmax(a) for a in transformed]][:5]

['monitor', 'centr', 'domest', 'improv', 'archiv']

In [67]:
feature_names[:5]

['10', '13', '159', '189', '19']

In [68]:
from sklearn.decomposition import LatentDirichletAllocation
n_topics = 20
lda = LatentDirichletAllocation(n_components=n_topics)
lda.fit(transformed)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=20, n_jobs=None,
                          perp_tol=0.1, random_state=None,
                          topic_word_prior=None, total_samples=1000000.0,
                          verbose=0)

In [69]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [70]:
print_top_words(lda, feature_names, 10)

Topic #0: inspector commun function critic hybrid subject proceed blend model centr
Topic #1: minor risk exist view sector whot work postgradu condit would
Topic #2: develop polici statu literaci career pathway childhood defect block strategi
Topic #3: employ enterpris tánaist trade 19 covid ask support due addit
Topic #4: refuge wage salari allow moira particular fire lesbo greec camp
Topic #5: committe pub met engag within view held organis meet sick
Topic #6: initi appoint media union member rebat statutori 2013 abolish futur
Topic #7: statu overse box dundalk dkit bid oversight technolog programm group
Topic #8: plan improv childcar condit worker take pay step children integr
Topic #9: educ research scienc higher innov student third level ask colleg
Topic #10: children equal disabl integr provid ask fund depart provis 2021
Topic #11: white paper ensur campus contact archiv truth schedul either action
Topic #12: tim refuge part ppe plan sinc request year august sean
Topic #13: atten

In [74]:
for a in range(n_topics):
    display(questions[pd.Series([np.argmax(a) for a in lda.transform(transformed)])==a][:3])

Unnamed: 0,date,showAs,by
1,2020-10-20,ask children disabl equal integr carri review fund model famili resourc centr commun servic remit tusla,Violet-Anne Wynne
56,2020-10-20,ask higher educ research innov scienc way blend learn proceed current academ year variou subject area,Neale Richmond
82,2020-10-20,ask tánaist enterpris trade employ estim cost 2021 recruit seven addit heo inspector 35 eo inspector workplac relat commiss,Catherine Murphy


Unnamed: 0,date,showAs,by
7,2020-10-20,ask children disabl equal integr view whether polici chang requir interest safeguard minor case tusla notifi deport order issu minor,Mick Barry
13,2020-10-20,ask children disabl equal integr expand provis budget 2021 give addit €5 million youth servic address exist fund anomali relat exist fund stream address fund inequ counti affect,Jennifer Whitmore
50,2020-10-20,ask higher educ research innov scienc discuss health chief medic offic august septemb 2020 decis made third level colleg would reopen campu full risk analysi carri risk involv given reopen would bring larg number student togeth mani part countri thu increas dramat risk spread nationwid student becam infect,Éamon Ó Cuív


Unnamed: 0,date,showAs,by
11,2020-10-20,ask children disabl equal integr statu develop career pathway earli childhood educ,Alan Farrell
53,2020-10-20,ask higher educ research innov scienc statu posit regard ten-year adult literaci numeraci digit literaci strategi promis 2020 programm govern detail group task develop strategi,Catherine Connolly
60,2020-10-20,ask higher educ research innov scienc establish scienc technolog polici fellowship within civil servic provid opportun scientist engin learn first-hand polici make contribut knowledg analyt skill polici develop,Denis Naughten


Unnamed: 0,date,showAs,by
24,2020-10-20,ask children disabl equal integr letter issu pobal provid relat refund reason offici seek refund provid whose busi risk closur due histor inadequ fund,Kathleen Funchion
31,2020-10-20,ask children disabl equal integr measur put place support children statist vulner group reduc uptak addict habit smoke drink,Alan Farrell
61,2020-10-20,ask higher educ research innov scienc safeti measur place student either erasmu programm current academ year consid embark programm 2021-22 academ year,Neale Richmond


Unnamed: 0,date,showAs,by
12,2020-10-20,ask children disabl equal integr plan assist refuge particular children affect fire moira refuge camp lesbo greec,Pádraig O'Sullivan


Unnamed: 0,date,showAs,by
9,2020-10-20,sean ask children disabl equal integr view introduc sick pay earli year educ worker held meet trade union member organis relat brief,Seán Sherlock
72,2020-10-20,ask taoiseach detail cabinet committe membership committe committe met date committe met,Noel Grealish
77,2020-10-20,ask tánaist enterpris trade employ financi support avail person pub owner sole employe pub view increas level three covid-19 restrict,Jackie Cahill


Unnamed: 0,date,showAs,by
49,2020-10-20,ask higher educ research innov scienc parliamentari question 159 10 septemb 2020 plan develop expans local train initi,David Stanton
64,2020-10-20,ask higher educ research innov scienc new initi educ research depart plan 2021 design acceler capac deliv infrastructur practic underpin ambit reduct emiss damag climat,Richard Bruton
74,2020-10-20,ask taoiseach member union detail suppli appoint commiss futur media,Cathal Crowe


Unnamed: 0,date,showAs,by
17,2020-10-20,ask children disabl equal integr statu oversight group overse develop babi box programm,Neale Richmond
44,2020-10-20,ask higher educ research innov scienc statu technolog univers statu bid dundalk dkit,Ruairí Ó Murchú
98,2020-10-15,ask tánaist enterpris trade employ plan allow taxi driver self-employ person access support scheme similar sme current exclud,Paul McAuliffe


Unnamed: 0,date,showAs,by
3,2020-10-20,ask children disabl equal integr step plan take improv pay condit childcar worker,Denis Naughten
14,2020-10-20,ask children disabl equal integr plan improv pay condit childcar worker,Paul McAuliffe
19,2020-10-20,ask children disabl equal integr timelin place implement workplac relat commiss agreement august 2019 siptu citi counti childcar committe fund provid budget 2021,Christopher O'Sullivan


Unnamed: 0,date,showAs,by
34,2020-10-20,ask children disabl equal integr still accept sixth recommend person detail suppli provid suitabl memori victim survivor magdalen laundri follow state apolog made 2013,Gary Gannon
38,2020-10-20,ask higher educ research innov scienc discuss taken place privat purpose-built student accommod provid relat provid flexibl arrang 2020-21 academ year,Pádraig O'Sullivan
39,2020-10-20,ask higher educ research innov scienc measur take increas number place third level account demograph chang ensur cao 2021 intak advers impact issu regard intak 2020 ensur third-level place want,Mick Barry


Unnamed: 0,date,showAs,by
0,2020-10-20,ask children disabl equal integr detail fund provid crèche condit increas price way monitor,Emer Higgins
5,2020-10-20,ask children disabl equal integr statu commiss investig mother babi home certain relat matter record anoth bill 2020,Ruairí Ó Murchú
6,2020-10-20,ask children disabl equal integr expand provis fund asylum seeker way fund address accommod issu direct provis system,Jennifer Whitmore


Unnamed: 0,date,showAs,by
4,2020-10-20,ask children disabl equal integr posit action either taken depart schedul futur archiv infrastructur ensur truth mother babi home,Gary Gannon
35,2020-10-20,ask children disabl equal integr timefram public white paper replac direct provis system detail regard fund set asid budget 2021 ensur implement white paper,Pauline Tully


Unnamed: 0,date,showAs,by
20,2020-10-20,ask children disabl equal integr statu progress fulfil plan expand capac care unaccompani refuge children timelin plan accept children moria refuge camp,Catherine Connolly
26,2020-10-20,sean ask children disabl equal integr request ppe made childcar earli year educ provid sinc august 2020 provid breakdown,Seán Sherlock
37,2020-10-20,ask higher educ research innov scienc consid remov rule within susi allow student reclassifi point entri unless take three-year break educ unfairli target independ student,Gary Gannon


Unnamed: 0,date,showAs,by
10,2020-10-20,ask children disabl equal integr attent drawn issu surround regist chick number also impact quick turnaround nc applic,Kathleen Funchion
28,2020-10-20,ask children disabl equal integr posit regard statu widespread test plan resid direct provis,Neale Richmond
30,2020-10-20,ask children disabl equal integr process develop youth servic area grow popul rathcool counti dublin detail suppli,Mark Ward


Unnamed: 0,date,showAs,by
15,2020-10-20,ask children disabl equal integr step take ensur travel commun enjoy real equal societi programm depart place promot equal commun,Éamon Ó Cuív
78,2020-10-20,ask tánaist enterpris trade employ receiv submiss bodi detail suppli seek relax deadlin complianc rule due except pressur creat covid-19,Richard Bruton
91,2020-10-20,ask higher educ research innov scienc request third level univers reduc fee student complet master 's degre view fact on-campu teach,Seán Haughey


Unnamed: 0,date,showAs,by
2,2020-10-20,ask children disabl equal integr date tusla review provis safe emerg accommod victim domest violenc publish analysi carri depart and-or tusla impact covid-19 pandem number emerg accommod space avail victim domest violenc,Catherine Connolly
22,2020-10-20,ask children disabl equal integr plan place assist refuge particular children affect fire moria refuge camp lesbo greec,Christopher O'Sullivan
47,2020-10-20,ask higher educ research innov scienc discuss third level institut relat refund rent student rent on-campu accommod find cours deliv onlin forese futur due covid-19 crisi accommod need,Éamon Ó Cuív


Unnamed: 0,date,showAs,by
27,2020-10-20,ask children disabl equal integr reason earli childcar provid lost immedi famili even covid-19 take day funer told depart day,Peadar Tóibín
32,2020-10-20,sean ask children disabl equal integr taken formal respons disabl matter,Seán Sherlock
96,2020-10-15,ask tánaist enterpris trade employ number valu loan approv date covid-19 credit guarante scheme,Pearse Doherty


Unnamed: 0,date,showAs,by
40,2020-10-20,ask higher educ research innov scienc view whether signific increas fund requir ensur susi grant cover genuin cost relat access third level train address part review susi grant scheme,Mick Barry
42,2020-10-20,ask higher educ research innov scienc consid chang criteria calcul susi grant elig includ whether rent receipt work famili yment,Pa Daly


Unnamed: 0,date,showAs,by
8,2020-10-20,ask children disabl equal integr reason order qualifi nation childcar scheme nc parent need public servic card given depart done away requir public servic card access servic name passport servic,Kathleen Funchion
33,2020-10-20,ask children disabl equal integr attent drawn fact seriou ongo issu term servic pobal provid childcar provid detail suppli,Kathleen Funchion
65,2020-10-20,ask higher educ research innov scienc engag colleg third level institut ensur recent alloc mental health fund benefit student need assist view fact studi remot could base anywher countrywid way student identifi colleg assist way fund measur depart ensur student need assist benefit fund,Aindrias Moynihan


Unnamed: 0,date,showAs,by
21,2020-10-20,ask children disabl equal integr expand provis contain budget 2021 intend reduc child poverti ireland,Jennifer Whitmore
46,2020-10-20,ask higher educ research innov scienc manner hardship fund €50 million announc depart budget day roll made avail student,Peadar Tóibín
54,2020-10-20,ask higher educ research innov scienc rational abolish fee third level postgradu sector budget 2021,Richard Boyd Barrett
