<h1 align='center'>Data workflow</h1>

<h3 align='center'>Laura G. Funderburk</h3>

<h3 align='center'>Data Scientist, Cybera</h3>

<h2 align='center'>3 Data Workflow Practices I follow</h2>


1. Leverage both Bash and Python (or R) to access and process data

2. Don't be afraid of messy exploratory work

3. ...but be sure to tidy up as you go!

<h2 align='center'>Leverage both Bash and Python (or R) to access and process data</h2>


To motivate this exercise, I will showcase how we can leverage these tools to write a set of scripts that:

1. Download data from two different API containing data on CO2 emissions. 

2. Time downloads. 

3. Generate CSV with query output and time to download.

4. Generate visualizations of the time it took two different APIs to download the data.

5. Push results to a GitHub repository.

<h2 align='center'>Don't be afraid of messy exploratory work</h2>

The scripts with final work look nice...so to motivate how I got there, here is some of what my initial exploration looked like. 

In [32]:
# Code to curl data using Python
import requests
import pandas as pd
import time

url = "https://api.carbonintensity.org.uk/intensity"
response = requests.get(url)
response.raise_for_status()
# access JSOn content
jsonResponse = response.json()

In [75]:
# Code to append time it took to download

# Time query
start_time = time.time()
# Using GET command 
response = requests.get(url)
total_time = time.time() - start_time
# Raise issues if response is different from 200
response.raise_for_status()
# access JSOn content
jsonResponse = response.json()

In [76]:
# Code to turn the JSON blob into a dataframe and save as a CSV
print(jsonResponse)

print("TO DATAFRAME")

display(pd.json_normalize(jsonResponse, record_path='data'))

# add time stamp

dates = pd.to_datetime('today')
dates_str = str(dates)
date_f = dates_str.split(" ")[0]
time_f = dates_str.split(" ")[1].split(".")[0]

{'data': [{'from': '2021-07-01T04:30Z', 'to': '2021-07-01T05:00Z', 'intensity': {'forecast': 188, 'actual': 174, 'index': 'moderate'}}]}
TO DATAFRAME


Unnamed: 0,from,to,intensity.forecast,intensity.actual,intensity.index
0,2021-07-01T04:30Z,2021-07-01T05:00Z,188,174,moderate


In [77]:
# Code to generate visualizations

df = pd.json_normalize(jsonResponse, record_path='data')

df['query.name'] = str("Intensity")
df['query.lasted'] = total_time

df['query.date'] = date_f
df['query.time'] = time_f

In [74]:
df

Unnamed: 0,from,to,intensity.forecast,intensity.actual,intensity.index,query.name,query.lasted,query.date,query.time
0,2021-07-01T04:30Z,2021-07-01T05:00Z,188,174,moderate,Intensity,0.74233,2021-06-30,22:29:09


<h2 align='center'>...but be sure to tidy up as you go!</h2>

1. Ensure to give functions and variables descriptive names

2. Document what your code does, add notes on specific data types any functions you write take and return, along with meaningful names

3. Test that your code does what you think it does

4. **R**efactor code, **R**educe repetition, **R**emove unused code (no need to keep a large piece of commented code "in case I need it some day"). 

In [None]:
# Demonstrate using Jupyter,editing on VS Code to showcase refactoring and cleaning code


In [80]:
# Demonstrate runing the script several times to ensure it works
%run -i ./scripts/uk_co2_download_performance_monitor.py

0.8349928855895996
Could not complete query


In [None]:
# Demonstrate a sample of unit testing 

In [None]:
# Demonstrate bringing our Python script into Bash and automate

<h2 align='center'>Data Collaboration Practices I follow</h2>


1. When you write code, assume another person at some point in the future will review it. Write your code as if you are writing an article for someone else to read. 

2. GitHub etiquette to contribute code: fork a repository, create a new branch, make changes on that branch, create a pull request. Create clear notes on what the contribution does. 

    **Pro-tip: read a repository's issues and comment on them proposing your changes before you invest time creating something that might be a duplicate effort, or which is not compatible.**
    
3. GitHub etiquette to request code changes: create **clear**, **concise**, **well documented**, and **specific** issues and documentation on how members can interact with and add content to the repository. 

4. Do your best to provide construtive feedback, and assume that when someone provides feedback to you, they want to help you improve the quality of code/documentation/feature.

<h2 align='center'>Hands on exercise contributing to a repository</h2>

1. Visit https://github.com/cybera/DS-industry-fellowship-2021

2. View issues https://github.com/cybera/DS-industry-fellowship-2021/issues

3. Hands on time: breaking into two teams -> each team works together on one of the two issues


<h2 align='center'>Hands on exercise contributing to a repository</h2>


In these issues there are tasks for improving code quality in a python script called `dummy.py`. 

Let's first get familiar with what the code does.

In [30]:
%run -i ./scripts/dummy.py

3 raised to the power of 2 is 9. This method took 1.9073486328125e-06 seconds
3 raised to the power of 2 is 9. This method took 4.0531158447265625e-06 seconds


|`i` |`j` | `iter_sum`|
|-|-|-|
|0 |0 |0|
|0 |1 |1|
|0 |2 |3|
|1 |0 |4|
|1 |1 |6|
|1 |2 |9|

At the end of the loop, `iter_summ` returns 9.

<h2 align='center'>Summary of what we learned</h2>
