<br> <img src="images/uptake-logo-white.png" align="left" height="100" width="100">

### Data Science Case Study 
by Harsh Pandey

------

### How To Run the File

* Unzip the jar file: case-study-server.jar
* From terminal / CLI, launch the case study server on port 8081:
    `java -jar case-study-server 8081`
* Make sure your machine has python3.5x. 
* `pip install pipreqs`
* `pip install -r requirements.txt`

Finally, run
`python uptake.py`


We begin by first connecting to the appropriate API endpoint. For this project, our api end point is http://localhost:8081/. 

We also have three query gateways: 

* /sites
* /turbines
* /signals


So, first we will attempt to connect with the API using proper authentication. I am using Python's `requests` library to do this.

In [203]:
import requests
import pandas as pd
import json
import json
import importlib.util
import sys
import os


# api connection credentials
url = 'http://localhost:8081/'
apikey = 'casestudy'
head = {'apikey': apikey}


# sample get request
sample_req = requests.get(url+'sites', params={}, headers=head)
sample_req

<Response [200]>

Perfect! A 200 response. Connection to this api has been established now. 

### Now let's make our life easy by writing request functions for each of the gateways: 

In [204]:
# helper functions
def request_sites(url, head):
	endpointurl = url+'sites'
	params = {}

	# get with head only, parameter: None
	r = requests.get(endpointurl, params=params, headers=head)
	return r.text


def request_turbines(url, siteid, head):
	endpointurl = url+'turbines'
	params = {'siteId':siteid}

	# get with params and headers
	r = requests.get(endpointurl, params=params, headers=head)
	return r.text


def request_signals(url,turbineId, startEpochMs, endEpochMs, head):
	endpointurl = url+'signals'
	params = {'turbineId': turbineId, 'startEpochMs': startEpochMs, 'endEpochMs': endEpochMs}
	r = requests.get(endpointurl, params=params, headers=head)
	return r.text

### Sample response coming from each request


In [205]:
sites = request_sites(url, head)
sites

'[{"operatorName":"operator_1","siteId":"site_1","coordinates":[-90.344927,40.712915]},{"operatorName":"operator_2","siteId":"site_2","coordinates":[-120.566084,38.821521]},{"operatorName":"operator_3","siteId":"site_3","coordinates":[-101.590009,23.562728]}]'

In [206]:
turbines = request_turbines(url, "site_1", head)
turbines

'[{"turbineId":"turbine_1","coordinates":[-90.344923,40.712953],"manufacturer":"manufacturer_1","model":"OJLQ"},{"turbineId":"turbine_2","coordinates":[-90.344946,40.712944],"manufacturer":"manufacturer_1","model":"FEGU"},{"turbineId":"turbine_3","coordinates":[-90.344928,40.712931],"manufacturer":"manufacturer_1","model":"ENHY"},{"turbineId":"turbine_4","coordinates":[-90.344974,40.712964],"manufacturer":"manufacturer_1","model":"KIYY"},{"turbineId":"turbine_5","coordinates":[-90.344996,40.712974],"manufacturer":"manufacturer_1","model":"UJUY"},{"turbineId":"turbine_6","coordinates":[-90.344957,40.712996],"manufacturer":"manufacturer_1","model":"VCTF"}]'

In [207]:
signals = request_signals(url, "turbine_1", 1534898938360, 1534898938360, head)
signals

'[{"rpm":7.5,"temperature":32.6,"power":545.0,"windSpeed":30.0,"orientation":31.0,"vibration":8299.0,"cellTemp":30.3,"bearingTemp":30.8,"gear":"gear_3","timestamp":"2018-08-21T19:48:58.360-0500"}]'

## Transformation

Looks like the above reponse is a document with key value pairs. Pattern appears to be a json response.

I am using Python's `pandas` and `json` libary to start cleaning and transforming the incoming data from Customer API

In [208]:
def clean_response(requestBody):
    # Takes incoming request body and transforms it into a json body
    # further transforms the json body into a pandas dataframe and returns it

    requestBody = json.loads(requestBody) #convert to json body
    df = pd.DataFrame(requestBody) # set a new dataframe 'df' using requestBody
    return df


# clean incoming responses
sites = clean_response(sites)
turbines = clean_reponse(turbines)
signals = clean_response(signals)

In [209]:
sites.head()

Unnamed: 0,coordinates,operatorName,siteId
0,"[-90.344927, 40.712915]",operator_1,site_1
1,"[-120.566084, 38.821521]",operator_2,site_2
2,"[-101.590009, 23.562728]",operator_3,site_3


In [210]:
turbines.head()

Unnamed: 0,coordinates,manufacturer,model,turbineId
0,"[-90.344923, 40.712953]",manufacturer_1,OJLQ,turbine_1
1,"[-90.344946, 40.712944]",manufacturer_1,FEGU,turbine_2
2,"[-90.344928, 40.712931]",manufacturer_1,ENHY,turbine_3
3,"[-90.344974, 40.712964]",manufacturer_1,KIYY,turbine_4
4,"[-90.344996, 40.712974]",manufacturer_1,UJUY,turbine_5


In [211]:
signals.head()

Unnamed: 0,bearingTemp,cellTemp,gear,orientation,power,rpm,temperature,timestamp,vibration,windSpeed
0,30.8,30.3,gear_3,31.0,545.0,7.5,32.6,2018-08-21T19:48:58.360-0500,8299.0,30.0


### Fetch entire dataset from the API

We only need to fetch data for the following two datasets: turbines and signals. Sites data is complete.

In [212]:
# get all unique site ids 
total_sites = sites['siteId'].unique().tolist()
total_sites

['site_1', 'site_2', 'site_3']

In [222]:
# there are three site ids
# we will now extract all turbine data for each of the site id
all_turbine_data = pd.DataFrame()
for site in total_sites: 
    turbines = request_turbines(url, site, head)
    turbines = clean_reponse(turbines)
    all_turbine_data = all_turbine_data.append(turbines, ignore_index=True)

# Save data to temp destination
def create_destination(filename):
	destination = os.getcwd()+"/"+"raw_data"

	#check if the folders exists already
	if (os.path.exists(destination)):
		print("Warning: Folder for given dates exists at the location", destination)
	else: 
		os.mkdir(destination)

	filename = destination+"/"+filename+".csv"
	return filename

In [197]:
signals.head()

Unnamed: 0,bearingTemp,cellTemp,gear,orientation,power,rpm,temperature,timestamp,vibration,windSpeed
0,30.8,30.3,gear_3,31.0,545.0,7.5,32.6,2018-08-21T19:48:58.360-0500,8299.0,30.0


In [198]:
turbineIds = all_turbine_data['turbineId'].unique()
turbineIds

array(['turbine_1', 'turbine_2', 'turbine_3', 'turbine_4', 'turbine_5',
       'turbine_6', 'turbine_7', 'turbine_8', 'turbine_9', 'turbine_14',
       'turbine_15', 'turbine_16', 'turbine_17', 'turbine_18',
       'turbine_19', 'turbine_21', 'turbine_22'], dtype=object)

Based on the case study, possible number of epoch timestamps in milliseconds between Aug 20 2018 and Aug 26 2018 is:

* 518400 times number of unique turbineIds, or
* 518400 * 17 request calls.

I was debating as to making so many calls to the API is a good idea in a real world situation would be a good idea or not?

I am going to refer to the signals.csv file to get a list of possible timestamps in epochms.

In [182]:
# read signal_data from file
signal_data = pd.read_csv('case-study-server/signal.csv')

# convert columns in signal_data to a list
cols = signal_data.columns.tolist()

# check number of unique values per column in signal_data
for i in cols:
    print("Column is {} and total unique values are {}".format(i, signal_data[i].nunique()))

Column is rpm and total unique values are 62
Column is temperature and total unique values are 54
Column is power and total unique values are 100
Column is windSpeed and total unique values are 18
Column is orientation and total unique values are 23
Column is vibration and total unique values are 177
Column is cellTemp and total unique values are 55
Column is bearingTemp and total unique values are 38
Column is bladeAngle and total unique values are 27
Column is gear and total unique values are 5
Column is timestamp and total unique values are 185
Column is turbineId and total unique values are 17


In [223]:
# save api data locally
sites_file = create_destination('sites')
sites.to_csv(sites_file, index=False)

turbine_file = create_destination('all_turbines_data')
all_turbine_data.to_csv(turbine_file, index=False)


signal_file = create_destination('signal')
signal_data.to_csv(signal_file, index=False)



## Analysis

1. Load data
2. join data to analyze

In [None]:
Following Steps in the application: 
    

Connect to the database
Ingest files from raw_data/ to the database
