# ISTA 322 - A bit of python programming
 
This lesson is just a brief review of some python programming skills. Specifically, this is about how to make functions to either check some specific aspect of the data or extract something from it. As you learned so far, the major goal of data engineering is transforming the data into a more usable format.  Ideally you want to wrap up your transform operations into functions so they can be repeatedly applied to new data or in automated workflows. 
 
We're just covering the basics here.  This should be review as you all needed at least one python programming class to be here.  If you're rusty or only have had that one class be sure to take some time here.  If you've taken a bunch of python already you can probably skip this.  

## Import libraries and data

Let's work a bit with bikeshare data.  Many of you have seen this from my other classes, but it's a good dataset for developing various programming and data skills.  

One issue in this dataset is that there are outliers.  Riders might have put in improper birthdays, or bikes might have had strange ride times or distances.  You could visualize these features and look for outliers. You could also make functions that go up and count how many values are above or below a certain threshold.  We're going to do the latter to help us practice making functions. 

## Explore

In [6]:
# Libraries
import pandas as pd
import numpy as np

In [7]:
# Bring in data as bikes
bikes = pd.read_csv("https://docs.google.com/spreadsheets/d/16JEvhh52QMG51i_5cnppjE-JOeJRbjFwTkp5uBHQE0A/gviz/tq?tqx=out:csv")


In [9]:
# Quick look at the head of the data
bikes.head()

Unnamed: 0,start_time,end_time,bikeid,from_station_id,from_station_name,to_station_id,to_station_name,usertype,gender,birthyear,tourist,distance_miles
0,2018-01-01 2:33:36,2018-01-01 2:46:18,6068,193,State St & 29th St,120,Wentworth Ave & Cermak Rd (Temp),Subscriber,Male,1991.0,no,0.803477
1,2018-01-01 8:04:11,2018-01-01 8:19:56,2968,3,Shedd Aquarium,41,Federal St & Polk St,Subscriber,Male,1994.0,yes,0.803993
2,2018-01-01 10:31:32,2018-01-01 10:36:33,688,130,Damen Ave & Division St,213,Leavitt St & North Ave,Subscriber,Male,1987.0,no,0.51593
3,2018-01-01 13:23:23,2018-01-01 13:26:27,3713,80,Aberdeen St & Monroe St,198,Green St & Madison St,Subscriber,Male,1978.0,no,0.361167
4,2018-01-01 14:31:07,2018-01-01 14:38:02,6057,130,Damen Ave & Division St,130,Damen Ave & Division St,Subscriber,Male,1990.0,no,


In [10]:
# Describe it
bikes.describe()

Unnamed: 0,bikeid,from_station_id,to_station_id,birthyear,distance_miles
count,154966.0,154966.0,154966.0,102805.0,146945.0
mean,3463.353355,93.154169,144.825575,1981.877525,1.331433
std,1923.948631,84.20557,128.181158,11.083356,1.168471
min,1.0,2.0,2.0,1900.0,0.075341
25%,1771.0,36.0,49.0,1976.0,0.623241
50%,3584.0,90.0,96.0,1985.0,0.994552
75%,5176.0,133.0,199.0,1990.0,1.647395
max,6471.0,424.0,659.0,2003.0,16.106524


In [11]:
# How many rows?
bikes.shape

(154966, 12)

### Count age outliers
From our describe() we can see that we have some people who entered a birthyear of 1900.  The mean birthyear is 1981, which makes sense and suggests that there isn't a huge number of outliers at 1900.  We could make a histogram of this to get an idea of how many there are, but that's tricky in that as there are roughly 150k entries in this dataset, just a few outliers won't really show up on the histogram.
 
Let's instead make a function that takes a column, and then an upper threshold and lower threshold and returns how many values fall outside of that.  

To start, remember that to make a function you use the following main elements
```
def function_name(args):
  action # something you want to do to the data
  action # next thing you want to do
  return() # What element you want your function to kick back
```
* `def` tells python that you want to define a function.
* `function_name` is where you assign a unique name to your function for when you want to use it later.  Don't use a generic name or something that'll conflict with other existing functions (e.g. don't call it 'mean').
* 'args' are where you specify the specific arguments of the function.  

Let's make the function `count_outliers()`.  We want the function to do the following:

* Take a column name, min_value, max_value as argument
* Store how many values in the column fall below the min_value level
* Store how many values in the column fall below the max_value level
* Add the two values together
* Return the sum of those

To check how many fall above or below a value, we can just use a comparison operator (<, >, =, ...).  For example, below I ask how many values in a column fall below the year 1920.  

In [18]:
# How many birthyears are lower than 1920?
bikes['birthyear'] < 1920

0         False
1         False
2         False
3         False
4         False
          ...  
154961    False
154962    False
154963    False
154964    False
154965    False
Name: birthyear, Length: 154966, dtype: bool

Since that returned booleans, we can just apply the `.sum()` function on it to count

In [19]:
# Count them
(bikes['birthyear'] < 1920).sum()

9

Cool, so now let's do that operation in a function we'll define as `count_outliers`.  Keep in mind that we're going to want two arguments, the column name to check and then the min_value to check.

In [20]:
# Make count_outliers
def count_outliers(column, min_val): # define and specify your two arguments
  num_below = (column < min_val).sum() # get the sum of the comparison and store as num_below
  return(num_below)

Let's check to see if that works.  

In [21]:
count_outliers(column = bikes['birthyear'], min_val = 1920)

9

OK, that worked!  Let's now expand it so that it can also take an upper limit.  We might want this option as it may be unrealistic if say 10 year olds were renting bikes.

In [22]:
# expand to include max_val
def count_outliers(column, min_val, max_val): # add in max_value argument
  num_below = (column < min_val).sum() 
  num_above = (column > max_val).sum() # getting the number of outliers above
  total_outliers = num_below + num_above # get total number
  return(total_outliers)

Now let's use that function to see how many riders have birthyears earlier than 1920 and later than 2000.  These data were collected in 2018 and you likely needed to be older than 18 to rent, so that's a good upper bound.

In [23]:
count_outliers(column= bikes['birthyear'], min_val = 1920, max_val = 2000)

167

OK, so we have a bunch of younger riders renting them along with those few that were apparently born before 1920.  These may or may not have been real ages, but that takes domain knowledge to know for sure.  The bigger point of this exercise was to show how to build a simple function to do a specific operation.
 
It's worth mentioning that the function we made was not necessarily efficient.  It's instead built to break down the logic step-by-step to demonstrate how you add operations within functions.  But, efficiency aside, if you're new to function building I strongly suggest using this simple step-by-step approach vs. trying a one-line solution.  
 
FWIW if I wanted a single line to do this I would use the following
 
```
def count_outliers(column, min_val, max_val): 
  return(np.where((column < min_val) | (column > max_val), 1, 0).sum())
  ```


In [24]:
# See!
def count_outliers(column, min_val, max_val): 
  return(np.where((column < min_val) | (column > max_val), 1, 0).sum())

## Extract from JSON

So we have our episode info for the TV show Silicon Valley.  Let's write a function that returns a list of each episode name.  

To do this we'll do the following steps:
* figure out what level the episode name is in the data
* Make a loop that extracts that level from each episode entry
* Make into function that takes just the name of the JSON data 


In [25]:
# First just run this to import the data
import requests
url = 'http://api.tvmaze.com/singlesearch/shows?q=Silicon Valley&embed=episodes'
sv_json_obj = requests.get(url)
sv_json = sv_json_obj.json()

In [26]:
# You can check the data again
sv_json

{'id': 143,
 'url': 'https://www.tvmaze.com/shows/143/silicon-valley',
 'name': 'Silicon Valley',
 'type': 'Scripted',
 'language': 'English',
 'genres': ['Comedy'],
 'status': 'Ended',
 'runtime': 30,
 'averageRuntime': 30,
 'premiered': '2014-04-06',
 'ended': '2019-12-08',
 'officialSite': 'http://www.hbo.com/silicon-valley/',
 'schedule': {'time': '22:00', 'days': ['Sunday']},
 'rating': {'average': 8.4},
 'weight': 96,
 'network': {'id': 8,
  'name': 'HBO',
  'country': {'name': 'United States',
   'code': 'US',
   'timezone': 'America/New_York'},
  'officialSite': 'https://www.hbo.com/'},
 'webChannel': None,
 'dvdCountry': None,
 'externals': {'tvrage': 33759, 'thetvdb': 277165, 'imdb': 'tt2575988'},
 'image': {'medium': 'https://static.tvmaze.com/uploads/images/medium_portrait/215/538434.jpg',
  'original': 'https://static.tvmaze.com/uploads/images/original_untouched/215/538434.jpg'},
 'summary': '<p>In the high-tech gold rush of modern Silicon Valley, the people most qualified

So we want to get the episode information.  Looking at the structure of our JSON we can see we need to go down into '_embedded' and then to 'episodes'

In [27]:
# This will give us a list.  Each element of the list is a JSON of episode information
sv_json['_embedded']['episodes']

[{'id': 10897,
  'url': 'https://www.tvmaze.com/episodes/10897/silicon-valley-1x01-minimum-viable-product',
  'name': 'Minimum Viable Product',
  'season': 1,
  'number': 1,
  'type': 'regular',
  'airdate': '2014-04-06',
  'airtime': '22:00',
  'airstamp': '2014-04-07T02:00:00+00:00',
  'runtime': 30,
  'rating': {'average': 7.9},
  'image': {'medium': 'https://static.tvmaze.com/uploads/images/medium_landscape/49/123633.jpg',
   'original': 'https://static.tvmaze.com/uploads/images/original_untouched/49/123633.jpg'},
  'summary': "<p>Attending an elaborate launch party, Richard and his computer programmer friends - Big Head, Dinesh and Gilfoyle - dream of making it big. Instead, they're living in the communal Hacker Hostel owned by former programmer Erlich, who gets to claim ten percent of anything they invent there. When it becomes clear that Richard has developed a powerful compression algorithm for his website, Pied Piper, he finds himself courted by Gavin Belson, his egomaniacal c

In [28]:
# Given it's a list we can ask for specific ones like we did before
sv_json['_embedded']['episodes'][3]

{'id': 10900,
 'url': 'https://www.tvmaze.com/episodes/10900/silicon-valley-1x04-fiduciary-duties',
 'name': 'Fiduciary Duties',
 'season': 1,
 'number': 4,
 'type': 'regular',
 'airdate': '2014-04-27',
 'airtime': '22:00',
 'airstamp': '2014-04-28T02:00:00+00:00',
 'runtime': 30,
 'rating': {'average': 9},
 'image': {'medium': 'https://static.tvmaze.com/uploads/images/medium_landscape/49/123636.jpg',
  'original': 'https://static.tvmaze.com/uploads/images/original_untouched/49/123636.jpg'},
 'summary': "<p>At Peter's toga party, Richard drunkenly promises to make Erlich a board member, which he regrets the next morning. After being unassigned at Hooli, Big Head finds others like him who have made careers out of doing nothing. Richard struggles to put Pied Piper's vision into words for a presentation without Erlich; later, he discovers an interesting connection between Peter and Gavin Belson.</p>",
 '_links': {'self': {'href': 'https://api.tvmaze.com/episodes/10900'}}}

In [29]:
# let's get just the episode name
sv_json['_embedded']['episodes'][3]['name']

'Fiduciary Duties'

### Making our loop
Great, so we've figured out how to extract just a single episode name. Now given it's in a list we can iterate through it and extract the name for each and every episode.  

A loop has the following general format
```
for i in length_to_run:
  store = action_on_i'th_element
  action_with_stored_data
```
* i is the abstracted variable.  So as we used the number 3 to get the 4th episode name when we called `sv_json['_embedded']['episodes'][3]['name']`, we could instead abstract that number to i where it then goes through each element in a given range
* The length to run is how many i values you want to iterate through.  This is normally the length of the data.  I normally start simple and test on a short range to start, though
* The 'store' part is doing an action on each element in your list
* The last part is what you want to do with each stored element before going on to the next.  Frequently this will be appending to a new list or adding to a data frame

Let's start with a simple loop that goes through the first five elements of our list of JSONs and prints the episode name.

In [30]:
# Simple loop
for i in range(5):
  print(sv_json['_embedded']['episodes'][i]['name'])

Minimum Viable Product
The Cap Table
Articles of Incorporation
Fiduciary Duties
Signaling Risk


So here we are just printing out the i'th episode name from sv_json.

I used `range(5)` which is an easy way to tell it how long to run for.  It essentially says run from 0:5. 

Now that the simple loop is working, let's instead store each name to a list as we know we're going to want that in the end.

To do this we make an empty list *outside of our loop*. Instead of printing the name we store it to an object *inside of the loop*. We then append each stored object to the end of this empty list *inside of the loop*.  

In [31]:
# Adding names to a list
episodes = []
for i in range(5):
  ep_name = sv_json['_embedded']['episodes'][i]['name']
  episodes.append(ep_name)

In [32]:
# Check
episodes

['Minimum Viable Product',
 'The Cap Table',
 'Articles of Incorporation',
 'Fiduciary Duties',
 'Signaling Risk']

Great, so the last way we want to beef up our loop is to make it run the whole length of the list of episode JSONs.  

The easy way to do this is to ask for the length of the list using `len()` and then getting the range of that value.  This way if the lenght changes it'll still work. 

In [33]:
# How many episodes?
len(sv_json['_embedded']['episodes'])

53

In [34]:
# Final loop
episodes = [] # make empty list
num_episodes = len(sv_json['_embedded']['episodes']) # store length of list (could be inside range too)
for i in range(num_episodes):
  ep_name = sv_json['_embedded']['episodes'][i]['name']
  episodes.append(ep_name)

In [35]:
# Yay
episodes

['Minimum Viable Product',
 'The Cap Table',
 'Articles of Incorporation',
 'Fiduciary Duties',
 'Signaling Risk',
 'Third Party Insourcing',
 'Proof of Concept',
 'Optimal Tip-to-Tip Efficiency',
 'Sand Hill Shuffle',
 'Runaway Devaluation',
 'Bad Money',
 'The Lady',
 'Server Space',
 'Homicide',
 'Adult Content',
 'White Hat/Black Hat',
 'Binding Arbitration',
 'Two Days of the Condor',
 'Founder Friendly',
 'Two in the Box',
 "Meinertzhagen's Haversack",
 'Maleant Data Systems Solutions',
 'The Empty Chair',
 'Bachmanity Insanity',
 'To Build a Better Beta',
 "Bachman's Earning's Over-ride",
 'Daily Active Users',
 'The Uptick',
 'Success Failure',
 'Terms of Service',
 'Intellectual Property',
 'Teambuilding Exercise',
 'The Blood Boy',
 'Customer Service',
 'The Patent Troll',
 'The Keenan Vortex',
 'Hooli-Con',
 'Server Error',
 'Grow Fast or Die Slow',
 'Reorientation',
 'Chief Operating Officer',
 'Tech Evangelist',
 'Facial Recognition',
 'Artificial Emotional Intelligence',


### Turning our loop into a function

So our loop works.  Now let's turn it into a function where we provide one argument - the name of the data.  

In [36]:
def get_eps_names(eps_data): # We'll have the only argument be 'eps_data' which is the object name of our episode data
  episodes = [] 
  num_episodes = len(eps_data['_embedded']['episodes']) # note it's now calling eps_data
  for i in range(num_episodes):
    ep_name = eps_data['_embedded']['episodes'][i]['name'] # calling eps_data
    episodes.append(ep_name)
  return(episodes)
 


In [37]:
# Test it out
get_eps_names(eps_data = sv_json)

['Minimum Viable Product',
 'The Cap Table',
 'Articles of Incorporation',
 'Fiduciary Duties',
 'Signaling Risk',
 'Third Party Insourcing',
 'Proof of Concept',
 'Optimal Tip-to-Tip Efficiency',
 'Sand Hill Shuffle',
 'Runaway Devaluation',
 'Bad Money',
 'The Lady',
 'Server Space',
 'Homicide',
 'Adult Content',
 'White Hat/Black Hat',
 'Binding Arbitration',
 'Two Days of the Condor',
 'Founder Friendly',
 'Two in the Box',
 "Meinertzhagen's Haversack",
 'Maleant Data Systems Solutions',
 'The Empty Chair',
 'Bachmanity Insanity',
 'To Build a Better Beta',
 "Bachman's Earning's Over-ride",
 'Daily Active Users',
 'The Uptick',
 'Success Failure',
 'Terms of Service',
 'Intellectual Property',
 'Teambuilding Exercise',
 'The Blood Boy',
 'Customer Service',
 'The Patent Troll',
 'The Keenan Vortex',
 'Hooli-Con',
 'Server Error',
 'Grow Fast or Die Slow',
 'Reorientation',
 'Chief Operating Officer',
 'Tech Evangelist',
 'Facial Recognition',
 'Artificial Emotional Intelligence',


Cool, works!  You could imagine making the function more complicated to instead allow you to specify other elements of the JSON you want.  But for now I hope this highlights how to make a function that then iterates through a list using a loop and returns the desired extracted data.  


## Try it on your own

Go and make a couple new functions.  Here are two to make:

* Make one that converts the timestamp data in the bikes dataset to datetimes, calculates the ride time, and then counts the number of rides over 120 minutes.

* Make one that extracts the episode number, season, title, and airdate.  Instead of making it go to a list can have it return a pandas dataframe. 

