This is a scraping script for extracting the results of the Romanian Baccalaureate from http://bacalaureat.edu.ro for the years 2006 - 2012.
- python 2.7
- python-lxml
- PylibLZMA (for LZMA/XZ compressed files)
- python-argparse (on Python 2.6)
yum install python-lxml pyliblzma
yum install python-lxml pyliblzma python-argparse
First you need to get the HTML pages that will be parsed. You can download them with a browser or you can use a spider, whatever fits you. Then you can parse them.
./main.py data/alfabetic_page_4.html
outputs:
Elev(nume=u'John Doe', scoala=u'Summer school', judet=u'B', promotie_anterioara=u'NU', forma_invatamant=u'Zi', specializare=u'Tehnician in activitati economice', d_romana_competente=u'Utilizator avansat', d_romana_scris_nota=u'5.45', d_romana_scris_nota_contestatie=u'', d_romana_scris_nota_finala=u'5.45', d_limba_materna_nume=u'', d_limba_materna_competente=u'', d_limba_materna_scris_nota=u'', d_limba_materna_scris_nota_contestatie=u'', d_limba_materna_scris_nota_finala=u'', d_limba_moderna_nume=u'Limba engleza', d_limba_moderna_nota=u'B2-A2-B2-B1-B1', d_profil_scris_nume=u'Matematica T2', d_profil_scris_nota=u'5', d_profil_scris_nota_contestatie=u'', d_profil_scris_nota_finala=u'5', d_alegere_scris_nume=u'Economie', d_alegere_scris_nota=u'8.05', d_alegere_scris_nota_contestatie=u'', d_alegere_scris_nota_finala=u'8.05', d_competente_digitale=u'Utilizator experimentat', rezultat_final=u'Reu\u015fit')
#######################################################################
...
#######################################################################
Elev(nume=u'Joahna Doe', scoala=u'Winter school', judet=u'B', promotie_anterioara=u'NU', forma_invatamant=u'Zi', specializare=u'Tehnician in activitati economice', d_romana_competente=u'Utilizator avansat', d_romana_scris_nota=u'5.45', d_romana_scris_nota_contestatie=u'', d_romana_scris_nota_finala=u'5.45', d_limba_materna_nume=u'', d_limba_materna_competente=u'', d_limba_materna_scris_nota=u'', d_limba_materna_scris_nota_contestatie=u'', d_limba_materna_scris_nota_finala=u'', d_limba_moderna_nume=u'Limba engleza', d_limba_moderna_nota=u'B2-A2-B2-B1-B1', d_profil_scris_nume=u'Matematica T2', d_profil_scris_nota=u'5', d_profil_scris_nota_contestatie=u'', d_profil_scris_nota_finala=u'5', d_alegere_scris_nume=u'Economie', d_alegere_scris_nota=u'8.05', d_alegere_scris_nota_contestatie=u'', d_alegere_scris_nota_finala=u'8.05', d_competente_digitale=u'Utilizator experimentat', rezultat_final=u'Reu\u015fit')
You can parse multiple files in one run. In order to save disk space, reading from compressed files is supported (formats: gzip, bzip2, lzma/xz).
You can specify the year of the exam with the --year
parameter. By the
default it's the last year supported.
If you're planning to analyze the results, you can dump them in the
pickle format. Then either read the pickle file from Python or convert it
to a CSV file using pickle2csv.py
. From there on, the sky's the limit.
Here's an example:
./main.py --format pickle data/alfabetic_page_4.html.xz | xz --best > results.pickle.xz
./pickle2csv.py results.pickle.xz > results.csv
The pickle file is actually composed of multiple pickle dumps in order to
minimize the memory usage, so you'll need to load pickles from it until EOF
.
As you can see, pickle2csv
supports compressed files, too.
The --year
parameter works in the same way as for main
.
If you don't need the pickle files, you can do it in a single step:
./main.py --format pickle data/alfabetic_page_4.html.xz | ./pickle2csv.py - > results.csv
Here's a list of software you might find useful for downloading the pages:
- bac-spider - the spider I used to get the results for years 2006 - 2011
- bac-spider-2012 - a spider for getting the results for year 2012
The code is too simple and too ugly to require legal paperwork, so I declare it public domain.
This wouldn't have been possible without the Sothink SWF Decompiler. Shame on Siveco for using Flash even if it wasn't really needed.