Skip to content
Fivos Karalis edited this page Nov 9, 2016 · 33 revisions

7/11/2016

tested 2014-15. fixed parser eidikothta for oromisthioi_defterovathmias (filename: perioxes)
tested 2013-14. Parser parse_link download html table if eniaioidior
testing 2012-13. Parser sownloads htmls. Moved data to external. To continue from eniaiosd_2012, exclude previous.

6/11/2016

parser now ignores html tables (due to size!), downloads only gz. (tables in e-aitisi exist in both formats, or only gz, or only xls(x))
hmermhnia fix, if in other format yyyymmdd(a/b) (2015-16), or position in path
tested and downloaded 2016-17 and 2015-16 so far

4/11/2016

parser.py downloads and DB ok (sqlalchemy, now). tested 2016 only
added Hmeromhnia.real_hmeromhnia, Pinakas.url_pinaka
TODO eidikothtes ignore kataggelia etc
TODO script timing??

3/11/2016

Polished db_init
Using sqlite for inserting in DB from Parser.TODO check this tutorial again for sqlalchemy http://pythoncentral.io/introductory-tutorial-python-sqlalchemy/
Filling in tables EXCEPT Pinakas

2/11/2016

Parser starts downloading -- tested only 2016 (not all..)

  • new methods: download_table, find_kathgoria (mapped all categories from excel)

  • no need for other filetypes but just in case:

      filetypes = ['.xls', '.xlsx', '.csv']
    
      if any(url.endswith(x) for x in filetypes):
    

30/10/2016

Installed sqlalchemy
checked out (regular colloquial, non git terminology) pmav's foivos.py --> db
tutorial: http://pythoncentral.io/introductory-tutorial-python-sqlalchemy/ (ok)
Updated README.md
tutorials: http://docs.sqlalchemy.org/en/latest/orm/backref.html (ok)
http://docs.sqlalchemy.org/en/latest/orm/session.html

28/10/2016

Parser class
deleted branches: globalsrid and modules

27/10/2016

suffix var in crawler no longer global, the rest (log, links, tables) too much for now
fixed issues 6 (remove project info from history) and 8 (split commit)
Added a few usage checks
started Wiki
TODO first 6 commits obsolete Done

Rebase Hell.. hope ok now

26/10/2016

globals no more
rearranged git tracking

25/10/2016

deleted/untracked data and polished remote history
.gitignore
https://rtyley.github.io/bfg-repo-cleaner/
https://git-scm.com/book/en/v2/Git-Tools-Rewriting-History

24/10/2016

pmav99

23/10/2016

categories.xls
http://e-aitisi.sch.gr/triantamino_07/index.html --> 2016 index !

22/10/2016

modules using globalvars.py
comparing logs...OK

21/10/2016

2010 OK

http://e-aitisi.sch.gr/eniaios_smea_orom_11_B/index.html
= http://e-aitisi.sch.gr/index2013.html

2011 OK
2012 ΟΚ
2013,14,15 OK
2016 OK

20/10/2016

VERSION 1 build links list (created before 'for') & json

    href = tag.get('href')
    
    #create dict for link
    link = {
        'link_url': url + '/' + href,
        'text' : tag.string,
    }
    
    # add link to links list 
    links.append(link)

print(links)
print(json.dumps(links, sort_keys=True, indent=4))


d = {"name":"links",
     "children": links}
j = json.dumps(d, sort_keys=True, indent=4)
with open('links.json', 'w') as fp:
    json.dump(j, fp)    

DOWNLOAD XLS

if any(url.endswith(x) for x in filetypes):
    print('--xcel:', url, tag.contents)
    
    filename = url.rsplit('/')[-1]
    if not os.path.isfile(filename):
        response = requests.get(url)
        with open(filename, 'wb') as output:
            output.write(response.content)
        print('Downloaded')
    else: 
        print('Already there')    

crawler checked from 2003-2004 up to 2009-2010
DONE run arg from command line
ΤΟΔΟ text in soup OK

19/10/2016

crawler still missing
πίνακες zero Προϋπηρεσίας με ίδιο όνομα
πίνακες σε html

μήπως script run ξεχωριστά για κάθε χρονιά; τελικά save time

αν αυτόματα:
να γίνεται νέος φάκελος (manual?) πφφφ

---- ΚΩΔΙΚΟΠΟΙΗΣΗ ----

date: yyyymmddl (l = a,b,c... μία φορά)
main_folder/category(/date)/(category/date/)filename.(xls/html)
main_folder = 2016-2017, 2015-2016 ..

categories:

  • kataggelia_16, 15, 14
  • diagrafentes_16
  • eniaios(p/d)(zero)2016 .. 2007 .. 2005
  • mousika_(orom_)2016 .. 2007 .. 2005
  • pe73_2016 Μειονοτικού Προγράμματος Μειονοτικών Σχολείων Πρωτοβάθμιας Εκπαίδευσης Θράκης
  • avlonas_2016
  • eniaios_smea_anap_16, 15, 14

2013-2014

  • eniaoidior_13 ΠΙΝΑΚΕΣ ΔΙΟΡΙΣΜΩΝ
  • eniaios40_2013
  • eikositetramino2013
  • triantamino_2013
  • orosmisthioi_2013 ... 2009, 2008, 2007
  • eniaios_smea_anap_13_(A/B), 12, 11
    • /date/ENIAIOS_SMEA_ANAP_13_(normal/braille/noimatiki/brailleNoimatiki)_(A/B)

2012-2013

  • μέχρι εδώ ημερομηνίες
  • από εδώ και πίσω:
  • specialcat_2012, 11
  • eniaios_diorismwn_12, 11, 10, 09, 08, 07, 06, 05
  • triantamino_12, 11, 10, 09, 08, 07, 06
  • eikositetramino_12, 11, 10, 09, 08
  • tadmon_2012, 11

2010

  • specialcat_2010/Phasi(A/B/BAthmia)
  • oloimera_2010, 2009, 2008, 2007, 2005    
    
  • eniaios_smea_anap_10_(braille/knowlang/braille_and_knowlang_)(A/B), 09    
    
  • eniaios_smea_(orom/oloim)_10_(A/B), 09      
    

2009

  • politeknoi2009
  • tad(anap/orom)_2009

2008

  • eniaios_smea_anap_08
  • (normal/Braille/Noimatiki)_(A/B)
  • smea_oromisthioi_2008    
    
  • (normal/Braille/Noimatiki)_(A/B)

2007

  • ΣΜΕΑ yok

2006

  • mousika_2006/em16.html Εμπειροτεχνών

2005

  • eniaios_oromis8iwn_05(_zero)

2004

  • eniaios(p/d)_2004 μόνο
  • pinakes(AB/C)(/PLIROFORIKH_D-E.html)

2003

  • eniaios_(a/b)thmias_2003   
    
  • pinakes(AB/G)/html(/PLIROFORIKH_D-E.html)

---*-

17/10/2016

Started crawler.py MOVED to main folder

OK for xls
work html

xls to sqlite???

13/10/2016

xls με requests:

import requests, os import http.client http.client.HTTPConnection._http_vsn = 10 http.client.HTTPConnection._http_vsn_str = 'HTTP/1.0' url="http://...xls" resp = requests.get(url) with open('test.xls', 'wb') as output: output.write(resp.content)

DONE -> 17/10/2016

επαιξα με requests na κανω τα ιδια οπως και urllib

project oriented, πίνακες χειρονακτικά σε βάση δικιά μου ή crawilng κάθε φορά

ok requests to get images from first page of site: requests2img

na ginei crawler ? telospantwn

requests docs!

12/10/2016

finished chapter 12 od python for all
about networking
remembered urllib
re
beautifulsoup

read documentation of beautiful soup

myurllink2img.py:
tried downloading images from in.gr
rejected error 503 no automated access

requests library better!

<-- time for requests

Clone this wiki locally