Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SG: don't fail when unable to OCR #1281

Merged
merged 5 commits into from Apr 3, 2018
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
73 changes: 40 additions & 33 deletions parsers/SG.py
@@ -1,11 +1,12 @@
#!/usr/bin/env python3

from collections import defaultdict
import logging
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, since this only used once maybe it should be from logging import getLogger.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this is a style thing (there is no perf difference). My very personal reason for liking logging.getLogger() is because I don't have to scroll up to see if getLogger was imported and from which module, or maybe a function somewhere within this file. To give an extreme example, if we were to do from re import search, it's impossible for someone diving straight into the code to guess what search is without reading all imports.

I don't think we have particular consistency in the codebase on this right now. At the same time I myself did the from collections import defaultdict right above :D

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current codebase: import logging 9 times including SG (and in all cases it is only used for getLogger), from logging import getLogger 7 times. So if I change it, it'll be a tie. Do we have a style guide? :D

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I thought there was a difference, never mind then it doesn't matter. If there was an EM style guide it would simply read "Chaos is good". 😈

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Carbon is bad, chaos is good"

import re

import arrow
from PIL import Image
from pytesseract import image_to_string
import re
import requests

TIMEZONE = 'Asia/Singapore'
Expand Down Expand Up @@ -40,18 +41,10 @@

For Electricity Map, we map CCGT and GT to gas, and ST to "unknown".

There appears to be no real-time data for solar production.
Installed solar has been rising rapidly, but from a very low base.
Per Singapore Energy Statistics 2016 pg 96, total installed solar PV capacity at end of 2015 was 45.8 MWac.
Per https://www.ema.gov.sg/cmsmedia/Publications_and_Statistics/Statistics/47RSU.pdf
total installed solar capacity in 2017Q1 was 129.8 MWp / 99.9 MWac, so capacity doubled during 2016.
However, when producing at max capacity this would only be about 2% of summer mid-night demand of around 5 GW.
So for now this won't introduce a big inaccuracy.

There exists an interconnection to Malaysia. Its capacity is apparently 2x 200 MW,
which is potentially 5-10% of normal use. I was unable to find data on its use
(not even historical, let along real-time). The Singapore Energy Statistics 2016 document
does not note any electricity exports or imports.
The Energy Market Authority estimates current solar production and publishes it at
https://www.ema.gov.sg/solarmap.aspx

There exists an interconnection to Malaysia, it is implemented in MY_WM.py.
"""

TYPE_MAPPINGS = {
Expand All @@ -61,52 +54,61 @@
}


def get_solar(session=None):
def get_solar(session, logger):
"""
Fetches a graphic showing estimated solar production data.
Uses OCR (tesseract) to extract MW value.
Returns a float or None.
"""

s = session or requests.Session()
url = 'https://www.ema.gov.sg/cmsmedia/irradiance/plot.png'
solar_image = Image.open(s.get(url, stream=True).raw)
solar_image = Image.open(session.get(url, stream=True).raw)

gray = solar_image.convert('L')
threshold_filter = lambda x: 0 if x<77 else 255
threshold_filter = lambda x: 0 if x < 77 else 255
black_white = gray.point(threshold_filter, '1')

text = image_to_string(black_white, lang='eng')

pattern = r'Est. PV Output: (.*)MWac'
val = re.search(pattern, text, re.MULTILINE).group(1)
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better handling, we shouldn't throw away data just because solar is not working.

pattern = r'Est. PV Output: (.*)MWac'
val = re.search(pattern, text, re.MULTILINE).group(1)

time_pattern = r'\d+-\d+-\d+\s+\d+:\d+'
time_string = re.search(time_pattern, text, re.MULTILINE).group(0)
except AttributeError:
msg = 'Unable to get values for SG solar from OCR text: {}'.format(text)
logger.warning(msg, extra={'key': 'SG'})
return None

time_pattern = r'\d+-\d+-\d+\s+\d+:\d+'
time_string = re.search(time_pattern, text, re.MULTILINE).group(0)
solar_dt = arrow.get(time_string).replace(tzinfo='Asia/Singapore')
singapore_dt = arrow.now('Asia/Singapore')
diff = singapore_dt - solar_dt

# Need to be sure we don't get old data if image stops updating.
if diff.seconds > 3600:
print('Singapore solar data is too old to use.')
msg = ('Singapore solar data is too old to use, '
'parsed data timestamp was {}.').format(solar_dt)
logger.warning(msg, extra={'key': 'SG'})
return None

# At night format changes from 0.00 to 0
# tesseract cannot distinguish singular 0 and O in font provided by image.
# This try/except will make sure no invalid data is returned.
try:
solar = float(val)
except ValueError as err:
except ValueError:
if len(val) == 1 and 'O' in val:
solar = 0.0
else:
print("Singapore solar data is unreadable - got {}.".format(val))
solar = None
msg = "Singapore solar data is unreadable - got {}.".format(val)
logger.warning(msg, extra={'key': 'SG'})
return None
else:
if solar > 200.0:
print("Singapore solar generation is way over capacity - got {}".format(val))
solar = None
msg = "Solar generation is way over capacity - got {}".format(val)
logger.warning(msg, extra={'key': 'SG'})
return None

return solar

Expand Down Expand Up @@ -159,7 +161,8 @@ def sg_data_to_datetime(data):
return data_datetime


def fetch_production(zone_key='SG', session=None, target_datetime=None, logger=None):
def fetch_production(zone_key='SG', session=None, target_datetime=None,
logger=logging.getLogger(__name__)):
"""Requests the last known production mix (in MW) of Singapore.

Arguments:
Expand Down Expand Up @@ -202,12 +205,12 @@ def fetch_production(zone_key='SG', session=None, target_datetime=None, logger=N

else:
# unrecognized - log it, then add into unknown
print(
'Singapore has unrecognized generation type "{}" with production share {}%'.format(
gen_type, gen_percent))
msg = ('Singapore has unrecognized generation type "{}" '
'with production share {}%').format(gen_type, gen_percent)
logger.warning(msg)
generation_by_type['unknown'] += gen_mw

generation_by_type['solar'] = get_solar(session=None)
generation_by_type['solar'] = get_solar(requests_obj, logger)

# some generation methods that are not used in Singapore
generation_by_type.update({
Expand All @@ -216,12 +219,16 @@ def fetch_production(zone_key='SG', session=None, target_datetime=None, logger=N
'hydro': 0
})

source = 'emcsg.com'
if generation_by_type['solar']:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea.

Copy link
Collaborator Author

@jarek jarek Mar 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it's too clever. When there is no solar production, the data that solar is 0 also comes from EMA...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe both should be included by default?

source += ', ema.gov.sg'

return {
'datetime': sg_data_to_datetime(data),
'zoneKey': zone_key,
'production': generation_by_type,
'storage': {}, # there is no known electricity storage in Singapore
'source': 'emcsg.com'
'source': source
}


Expand Down