## Question 0

This script is in charge of retrieving a sample from the webpage API of city bicing.
All the analysis is supported by a sample consisting of snapshots of all stations in Gracia 
and nearby ones with time gaps of 5 minutes. For the sake of simplicity the sample retrieval comprisses a time lapse of 24 hours with 5 minutes gaps. According to the demands on the assignment the data should be retrieved from 7:00AM to 10:00AM every day for the last one or two years.

The script gets a snapshot of Gracia district and whereabouts 
from "http://api.citybik.es/v2/networks/bicing".
It stores the snapshot in mongoDB and tag it with a timestamp.

'bike_leeching.py' belongs to the process of data retrieval for statistics 
ellaboration. The corpus of data will be a recurrent 5 minute snapshot from the API.
Instead of coding a loop of retrieval and sleep for 5 minutes, the process is managed 
from linux cron command. A loop/sleeping retrieval daemon needs to manage KILL interruptions
in order to avoid getting killed while writing data into the DB. I chosed a simple 
solution: not to code the management of KILLs and let cron to manage the retrieve-sleep loop. 


**DO NOT RUN** this jupyter notebook from this platform. There is no MONGODB daemon available to
interact with.
Further scripts use a dump of the mongoDB snapshot database. Such dump is stored in this
platform and are accesible from the scripts.


In [12]:
#!/usr/bin/python3
# -*- coding: utf-8 -*-

'''
Script gets a snapshot of Gracia district and whereabouts 
from "http://api.citybik.es/v2/networks/bicing".
It stores the snapshot in mongoDB and tag it with a timestamp.


'bike_leeching.py' belongs to the process of data retrieval for statistics 
ellaboration.
The corpus of data will be a recurrent 5 minute snapshot from the API.
Instead of coding a loop of retrieval and sleep for 5 minutes, the process is managed 
from linux cron command. 
A loop/sleeping retrieval daemon needs to manage KILL interruptions in order to avoid 
getting killed while writing data into the DB. I chosed a simple solution: not to code
the management of KILLs and let cron to manage the retrieve-sleep loop. 


DO NOT RUN this jupyter notebook from this platform. There is no MONGODB daemon available to
interact with.
Further scripts use a dump of the mongoDB snapshot database. Such dump is stored in this
platform and are accesible from the scripts.
'''

import requests
import urllib, json

import datetime

import pymongo as mongo
from pymongo import MongoClient


__author__ = "Alexis Torrano"
__email__ = "a.torrano.m@gmail.com"
__status__ = "Production"


In [None]:
# Get the JSON data and generate a timestamp for the current snapshot
url = "http://api.citybik.es/v2/networks/bicing"
response = requests.get(url, timeout=15)
now = datetime.datetime.now()
timeStampStr = str(now)
timeStampSeconds = now.timestamp()

# Check for HTTP codes other than 200
if response.status_code != 200:
    print("ERROR ", str(response.status_code))
    import sys
    sys.exit()

jsresp = response.json()
originPBOX = '08012' # Gracia district
stations = jsresp['network']['stations']        
# expand fields in 'extra' dictionary as new columns for easy future 
# one-step access when building dataFrame objects.
for x in stations:
    for k,v in x['extra'].items():
        x[k] = v    


In [None]:

## ASSUMPTION: 'name' attribute is ALTERNATIVE KEY for any station -> so, no repeats in loop
# Get all Gracia stations and its neighbour stations
GraciaStations = [x for x in stations if originPBOX == x['extra']['zip']]
neighbourIdxSet = set(n for g in GraciaStations for n in g['extra']['NearbyStationList'])
NeighbourStations = [x for x in stations if x['uid'] in neighbourIdxSet]

'''
Short of time, many decissions and things to solve; so, I store full json for
every station. After 24h leeching I will purge what I do not need and prepare a schema
to store a retrieval of statistics. I cannot risk to repeat a 24h sample.
'''

'''
# A station example
GraciaStations[1]
{'NearbyStationList': [107, 221, 226, 229],
 'address': 'Carrer del Canó',
 'districtCode': '5',
 'empty_slots': 26,
 'extra': {'NearbyStationList': [107, 221, 226, 229],
  'address': 'Carrer del Canó',
  'districtCode': '5',
  'status': 'CLS',
  'uid': 222,
  'zip': '08012'},
 'free_bikes': 0,
 'id': '61c9fa7147cd773aaa874a4174879cf7',
 'latitude': 41.40124,
 'longitude': 2.157483,
 'name': '222 - C/ DEL CANÓ, 1',
 'status': 'CLS',
 'timestamp': '2018-11-22T11:05:17.424000Z',
 'uid': 222,
 'zip': '08012'}
'''

In [None]:
# store Gracia and neighbour sets/lists in mongo with a timestamp
# each snapshot is structure of <timestamp,gracia list, full neighbours list>
mongoC = MongoClient('mongodb://localhost:27017/')
dbHosco = mongoC['HOSCO']
createIndex = not ("timeBikeAllocation " in dbHosco.list_collection_names())
timeBikeAllocation = dbHosco['timeBikeAllocation']
if createIndex:
    # collection "timeBikeAllocation" did not exist previously. It needs an index after creation.    
    dbHosco['timeBikeAllocation'].create_index("timeStampStr", unique=True)
    dbHosco['timeBikeAllocation'].create_index("timeStampSeconds", unique=True)

try:
    timeBikeAllocation.insert_one({
        "_id":timeStampSeconds,
        "timeStampStr":timeStampStr,
        "timeStampSeconds":timeStampSeconds,
        "gracia":GraciaStations,
        "neighbours":NeighbourStations})
        
except Exception as e:
    print(str(e))

