-
Notifications
You must be signed in to change notification settings - Fork 0
Home
tested 2014-15. fixed parser eidikothta for oromisthioi_defterovathmias (filename: perioxes)
tested 2013-14. Parser parse_link download html table if eniaioidior
testing 2012-13. Parser sownloads htmls. Moved data to external. To continue from eniaiosd_2012, exclude previous.
parser now ignores html tables (due to size!), downloads only gz. (tables in e-aitisi exist in both formats, or only gz, or only xls(x))
hmermhnia fix, if in other format yyyymmdd(a/b) (2015-16), or position in path
tested and downloaded 2016-17 and 2015-16 so far
parser.py downloads and DB ok (sqlalchemy, now). tested 2016 only
added Hmeromhnia.real_hmeromhnia, Pinakas.url_pinaka
TODO eidikothtes ignore kataggelia etc
TODO script timing??
Polished db_init
Using sqlite for inserting in DB from Parser.TODO check this tutorial again for sqlalchemy http://pythoncentral.io/introductory-tutorial-python-sqlalchemy/
Filling in tables EXCEPT Pinakas
Parser starts downloading -- tested only 2016 (not all..)
-
new methods: download_table, find_kathgoria (mapped all categories from excel)
-
no need for other filetypes but just in case:
filetypes = ['.xls', '.xlsx', '.csv'] if any(url.endswith(x) for x in filetypes):
Installed sqlalchemy
checked out (regular colloquial, non git terminology) pmav's foivos.py --> db
tutorial: http://pythoncentral.io/introductory-tutorial-python-sqlalchemy/ (ok)
Updated README.md
tutorials: http://docs.sqlalchemy.org/en/latest/orm/backref.html (ok)
http://docs.sqlalchemy.org/en/latest/orm/session.html
Parser class
deleted branches: globalsrid and modules
suffix var in crawler no longer global, the rest (log, links, tables) too much for now
fixed issues 6 (remove project info from history) and 8 (split commit)
Added a few usage checks
started Wiki
TODO first 6 commits obsolete Done
Rebase Hell.. hope ok now
globals no more
rearranged git tracking
deleted/untracked data and polished remote history
.gitignore
https://rtyley.github.io/bfg-repo-cleaner/
https://git-scm.com/book/en/v2/Git-Tools-Rewriting-History
pmav99
categories.xls
http://e-aitisi.sch.gr/triantamino_07/index.html --> 2016 index !
modules using globalvars.py
comparing logs...OK
2010 OK
http://e-aitisi.sch.gr/eniaios_smea_orom_11_B/index.html
= http://e-aitisi.sch.gr/index2013.html
2011 OK
2012 ΟΚ
2013,14,15 OK
2016 OK
VERSION 1 build links list (created before 'for') & json
href = tag.get('href')
#create dict for link
link = {
'link_url': url + '/' + href,
'text' : tag.string,
}
# add link to links list
links.append(link)
print(links)
print(json.dumps(links, sort_keys=True, indent=4))
d = {"name":"links",
"children": links}
j = json.dumps(d, sort_keys=True, indent=4)
with open('links.json', 'w') as fp:
json.dump(j, fp)
DOWNLOAD XLS
if any(url.endswith(x) for x in filetypes):
print('--xcel:', url, tag.contents)
filename = url.rsplit('/')[-1]
if not os.path.isfile(filename):
response = requests.get(url)
with open(filename, 'wb') as output:
output.write(response.content)
print('Downloaded')
else:
print('Already there')
crawler checked from 2003-2004 up to 2009-2010
DONE run arg from command line
ΤΟΔΟ text in soup OK
crawler still missing
πίνακες zero Προϋπηρεσίας με ίδιο όνομα
πίνακες σε html
μήπως script run ξεχωριστά για κάθε χρονιά; τελικά save time
αν αυτόματα:
να γίνεται νέος φάκελος (manual?) πφφφ
---- ΚΩΔΙΚΟΠΟΙΗΣΗ ----
date: yyyymmddl (l = a,b,c... μία φορά)
main_folder/category(/date)/(category/date/)filename.(xls/html)
main_folder = 2016-2017, 2015-2016 ..
categories:
- kataggelia_16, 15, 14
- diagrafentes_16
- eniaios(p/d)(zero)2016 .. 2007 .. 2005
- mousika_(orom_)2016 .. 2007 .. 2005
- pe73_2016 Μειονοτικού Προγράμματος Μειονοτικών Σχολείων Πρωτοβάθμιας Εκπαίδευσης Θράκης
- avlonas_2016
- eniaios_smea_anap_16, 15, 14
2013-2014
- eniaoidior_13 ΠΙΝΑΚΕΣ ΔΙΟΡΙΣΜΩΝ
- eniaios40_2013
- eikositetramino2013
- triantamino_2013
- orosmisthioi_2013 ... 2009, 2008, 2007
- eniaios_smea_anap_13_(A/B), 12, 11
-
- /date/ENIAIOS_SMEA_ANAP_13_(normal/braille/noimatiki/brailleNoimatiki)_(A/B)
2012-2013
- μέχρι εδώ ημερομηνίες
- από εδώ και πίσω:
- specialcat_2012, 11
- eniaios_diorismwn_12, 11, 10, 09, 08, 07, 06, 05
- triantamino_12, 11, 10, 09, 08, 07, 06
- eikositetramino_12, 11, 10, 09, 08
- tadmon_2012, 11
2010
- specialcat_2010/Phasi(A/B/BAthmia)
-
oloimera_2010, 2009, 2008, 2007, 2005
-
eniaios_smea_anap_10_(braille/knowlang/braille_and_knowlang_)(A/B), 09
-
eniaios_smea_(orom/oloim)_10_(A/B), 09
2009
- politeknoi2009
- tad(anap/orom)_2009
2008
- eniaios_smea_anap_08
- (normal/Braille/Noimatiki)_(A/B)
-
smea_oromisthioi_2008
- (normal/Braille/Noimatiki)_(A/B)
2007
- ΣΜΕΑ yok
2006
- mousika_2006/em16.html Εμπειροτεχνών
2005
- eniaios_oromis8iwn_05(_zero)
2004
- eniaios(p/d)_2004 μόνο
- pinakes(AB/C)(/PLIROFORIKH_D-E.html)
2003
-
eniaios_(a/b)thmias_2003
- pinakes(AB/G)/html(/PLIROFORIKH_D-E.html)
---*-
Started crawler.py MOVED to main folder
OK for xls
work html
xls to sqlite???
xls με requests:
import requests, os
import http.client
http.client.HTTPConnection._http_vsn = 10
http.client.HTTPConnection._http_vsn_str = 'HTTP/1.0'
url="http://...xls"
resp = requests.get(url)
with open('test.xls', 'wb') as output:
output.write(resp.content)
DONE -> 17/10/2016
επαιξα με requests na κανω τα ιδια οπως και urllib
project oriented, πίνακες χειρονακτικά σε βάση δικιά μου ή crawilng κάθε φορά
ok requests to get images from first page of site: requests2img
na ginei crawler ? telospantwn
requests docs!
finished chapter 12 od python for all
about networking
remembered urllib
re
beautifulsoup
read documentation of beautiful soup
myurllink2img.py:
tried downloading images from in.gr
rejected error 503 no automated access
requests library better!
<-- time for requests