# Lesson 6: Functions & Loops

Today:
1. Programming Elements: Functions
    + Why define our own functions
    + How to define your own functions in python
    + Application to classification
2. Programming Elements: Loops
    + Understanding the `for` loop
    + Tracing how variables change values during loops
    + Accessing entries of a data frame using loops
3. Application to classification

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

## Class Starter

Suppose that we have the following python commands

    total_purchase = 49
    x = total_purchase >= 75
    ships_to_US_address = True

    is_shipping_free = ships_to_US_address or x

Which of the following is a correct statement about 
- the value stored in the name `is_shipping_free`, and 
- what `x` represents?

  
<p>A. is_shipping_free is True; x checks if your total purchase is 75 or greater</p>
<p>B. is_shipping_free is True; x checks if shipping is free</p>
<p>C. is_shipping_free is False; x checks if your total purchase is 75 or greater</p>
<p>D. is_shipping_free is False; x checks if shipping is free</p>
<p>E. Statements A-D are all false</p>

Respond on PollEV: https://pollev.com/fshum

Answer: 

## 1. Functions

### Functions in Python

We have used many functions in python. For example: 
- print()
- sum([1, 3]) returns 4
- np.mean([1,3]) returns 2
- etc.

There are a lot of useful functions in python that we can use. However, sometimes we have a specific task for which we need to write our own python function.

### Why define our own functions

**Example**

You decided to check out a very popular East Village ramen restaurant for dinner on Friday night.  After waiting in line for two hours, you are finally seated.  As you are reading the menu, you realized that this restaurant is cash-only.  You have $28.75 with you and need to make sure that you have enough cash to pay for the dinner, including the 8.875% tax and the tip.

You are considering ordering a $\$$15 dish and a $\$$6 beverage.  How much would you have to pay if you are giving an 18% tip?

In [None]:
subtotal = 15 + 6
tip = subtotal * 0.18
tax = subtotal * .08875
total = subtotal + tip + tax

total

**Example, continued**

With 28.75 in your pocket, knowing that you still have some cash leftover, you wonder if you could afford a $\$$17 dish and a $\$$6 beverage, with the same 8.875\% tax and 18\% tip.

In [None]:
subtotal = 17 + 6
tip = subtotal * 0.18
tax = subtotal * .08875
total = subtotal + tip + tax

total

**Example, continued**

Since you only have $\$$28.75 but really want that $\$$17 dish and the $\$$6 beverage, you wonder if you can afford this meal if you only give a 15\% tip. 

In [None]:
subtotal = 17 + 6
tip = subtotal * 0.15
tax = subtotal * .08875
total = subtotal + tip + tax

total

**Takeaways**:
+ The above examples are all similar (and **repetitive**!)
+ Same method of computation, just different numbers
+ Wouldn't it be nice if there is a python function that allows us to do the above repetitions easily?  Something like

        calculate_bill( LISTOFPRICESOFITEMS, TIPPERCENTAGE)
  that calculates the total bill, given a list of prices/costs of items and how many percent tip we want to give

### 1.2 How to define your own functions in python

    def MYFUNCTION( INPUT1 , INPUT2 , ...):
        ...
        ...
        return OUTPUTVALUE

Here:
- `MYFUNCTION` is the name of your new function (you choose the name)
- `INPUT1`, `INPUT2` etc. are the input(s) to the function (you can choose the names)
- `OUTPUTVALUE` is the name of the variable whose value is returned by the function (you can choose the name).

**Example**

In [None]:
def calculate_bill(list_of_prices, tip_percentage):
    subtotal = sum(list_of_prices)
    tip = subtotal * tip_percentage
    tax = subtotal * .08875
    total = subtotal + tip + tax
    return total
 

In [None]:
prices = [15, 6]
tip = 0.18
calculate_bill(prices, tip)

In [None]:
prices = [17, 6]
tip = 0.18
calculate_bill(prices, tip)

In [None]:
prices = [17, 6]
tip = 0.15
calculate_bill(prices, tip)

#### Concept Check

Suppose that I came up with a new function called `myfunction()`, defined as follows


        def myfunction ( x, y ):
            z = x**2 + y
        return( z )
    
If we run the command `myfunction(2, 3)`, what value would be returned by this function?

A. 4

B. 5

C. 6

D. 7

E. None of the above

Respond on PollEV: https://pollev.com/fshum

### Activity 1

Define a new function called `my_function2`, which
+ takes three numbers as inputs: `a`, `b`, `c`, and
+ if `a` is strictly greater than zero, then it returns as an output `b + c`;
+ otherwise, if `a` is zero or negative, then it returns as an output `b - c`.

### Activity 2

Example:

If we invest 1000 with 2% annual interest rate: after n years, $1000 * (1 + 0.02) ^n$

Define a new function called `compound_interest()` which takes three inputs
+ `initial_deposit`: the amount you deposited in the savings account
+ `interest_rate`: the annual interest rate of the account (in decimal)
+ `num_years`: the number of years the initial deposit stays in the account

and outputs/returns `account_total`, the total amount in the savings count after the specified number of years,

Then, check by using the function to compute the total amount in the account if:
+ we initially deposited 1000 dollars, the annual interest rate is 0.02, and we keep the account for 2 years 
+ we initially deposited 1250 dollars, the annual interest rate is 0.03, and we keep the account for 30 years 
+ try other inputs!

## 2. Loops

“Loops” are used when we want to repeat the same task for each member of a list.

### 2.1. Understanding `for` loops

To repeat TASK for each VALUE in the list LIST

    for( VALUE in LIST ):
        TASKS

**Example**

Suppose we want to display the text:  

"1 squared is 1"

"2 squared is 4"

... up to

"20 squared is 400"

In [None]:
numlist = np.arange(1, 21)

In [None]:
numlist

In [None]:
for x in numlist:
    print(x, ' squared is ', x ** 2)

### Activity

Suppose we want to display the text:  

"1 cubed is 1"

"2 cubed is 8"

"3 cubed is 27"

... up to

"20 cubed is 8000"

**Write a for loop that accomplishes this task.**

### Activity

Suppose we want to display the text:  

"1 cubed is 1"

"3 cubed is 27"

"5 cubed is 125"

... up to

"25 cubed is 15625"

**Write a for loop that accomplishes this task.**

### 2.2 Tracing how variables change values during loops

**Example: Trace what's going on with the following for-loop.**

In [None]:
mylist = [-3, 5, 0, 7, 10]
y = 1
for x in mylist:
    y = x+y
    z = y ** 2
    print(z)

| x | y | z |
| --- | --- | --- |
| | 1 | |
| -3 | -2 | 4 |
| 5 | 3 | 9 |
| 0 | 3 | 9 |
| 7 | 10 | 100 |
| 10 | 20 | 400 |

**Concept Check 1:**

What will the following for loop do?

    a=0
    for i in [1, 2, 3, 4]:
        a = a + i
    print(a)

A. It will print out the values: 1, 2, 3, 4

B. It will print out the values: 0, 1, 3, 6, 10

C. It will print out the values: 1, 3, 6, 10

D. It will print out the value: 4

E. It will print out the value: 10

F. None of the above

**Concept Check 2:**

What will the following for loop do?

    a=0
    for i in [1, 2, 3, 4]:
        a = a + i
        print(a)

A. It will print out the values: 1, 2, 3, 4

B. It will print out the values: 0, 1, 3, 6, 10

C. It will print out the values: 1, 3, 6, 10

D. It will print out the value: 4

E. It will print out the value: 10

F. None of the above

**Example**

Recall our `calculate_bill()` function from above, reproduced below.

In [None]:
# copy and paste function below
def calculate_bill(list_of_prices, tip_percentage):
    subtotal = sum(list_of_prices)
    tip = subtotal * tip_percentage
    tax = subtotal * .08875
    total = subtotal + tip + tax
    return total


Suppose that we would like to compute possible bills for a few different tip percentages, from 10\%, 11%, 12%, ..., to 25\%, if we order the following items:
+ an \$7 appetizer
+ a \$15 entree
+ a \$17 entree
+ two \$6 beverages.

### 2.3. Accessing entries of a data frame during loops

Suppose that we would like to store the values that we computed during loops into a table.

**Example**

Create a data frame called `squares` which has 20 rows and 2 columns:

<table>
    <tr>
        <th>n</th>
        <th>n_squared</th>
    </tr>    
    <tr>
        <td>1</td>
        <td>1</td>
    </tr>    
    <tr>
        <td>2</td>
        <td>4</td>
    </tr>    
    <tr>
        <td>3</td>
        <td>9</td>
    </tr>    
    <tr>
        <td>...</td>
        <td>...</td>
    </tr>    
    <tr>
        <td>19</td>
        <td>361</td>
    </tr>
    <tr>
        <td>20</td>
        <td>400</td>
    </tr>
</table>

In [None]:
squares = pd.DataFrame(np.empty((20, 2)), columns = ['n', 'n_squared'])
squares
                    

In [None]:
squares = pd.DataFrame(index=range(20), columns = ['n', 'n_squared'])
squares

In [None]:
# fill in the empty data frame row by row
for row in range(20):
    n = row + 1
    squares.loc[row, 'n'] = n
    squares.loc[row, 'n_squared'] = n **2

squares


In [None]:
# note: the above can be done without a for loop
#  using a method we learned earlier in the semester

squares_2 = pd.DataFrame(index=range(20), columns = ['n', 'n_squared'])
squares_2['n'] = np.arange(1, 21)
squares_2['n_squared'] = squares_2['n'] ** 2
squares_2




# so why did we use a for loop?  
#  There are similar tasks that cannot be done using this more straightforward method
#   See the next example

**Activity**

we defined a new function called `my_function2`, which
+ takes three numbers as inputs: `a`, `b`, `c`, and
+ if `a` is strictly greater than zero, then it returns as an output `b + c`;
+ otherwise, if `a` is zero or negative, then it returns as an output `b - c`.

We want to record the outputs of `my_function2` for various values of a, b, and c in the data frame called `records` below:

<table>
    <tr>
        <th>a</th>
        <th>b</th>
        <th>c</th>
        <th>output</th>
    </tr>
    <tr>
        <td>3</td>
        <td>1</td>
        <td>3</td>
        <td> </td>
    </tr>
    <tr>
        <td>-2</td>
        <td>10</td>
        <td>3</td>
        <td> </td>
    </tr>
    <tr>
        <td>1</td>
        <td>4</td>
        <td>9</td>
        <td> </td>
    </tr>
    <tr>
        <td>0</td>
        <td>4</td>
        <td>8</td>
        <td> </td>
    </tr>
</table>


In [None]:
# do not modify this cell

records = pd.DataFrame( {'a': [3, -2, 1, 0],
                         'b': [1, 10, 4, 4],
                         'c': [3, 3, 9, 8],
                         'output': [0, 0, 0, 0]} )
records

In [None]:
# copy and paste function below


In [None]:
# filling in the output column in the records data frame "by hand" / row by row
## this is fine because we have only four rows!
## we probably don't want to do this if we have hundreds of rows

for row in range(4):
    records.iloc[row, 3] = my_function2(records.iloc[row, 0], records.iloc[row, 1], records.iloc[row, 2])

records

In [None]:
# do not modify this cell

records = pd.DataFrame( {'a': [3, -2, 1, 0],
                         'b': [1, 10, 4, 4],
                         'c': [3, 3, 9, 8],
                         'output': [0, 0, 0, 0]} )
records

In [None]:
records['output'] = my_function2(records['a'], records['b'], records['c'])

**Example**

Suppose that we would like to compute possible bills for a few different tip percentages, from 10\%, 11%, 12%, ..., to 25\%, if we order the following items:
+ an \$7 appetizer
+ a \$15 entree
+ a \$17 entree
+ two \$6 beverage.

We would like to create a data frame with 2 columns and one row for each possible tip percentages.  The first column is the tip percentage itself and the second column is the total bill:

<table>
    <tr>
        <th>tip_percentage</th>
        <th>total</th>
    </tr>    
    <tr>
        <td>10</td>
        <td>60.62625</td>
    </tr>    
    <tr>
        <td>11</td>
        <td>61.64625</td>
    </tr>    
    <tr>
        <td>12</td>
        <td>62.15625</td>
    </tr>    
    <tr>
        <td>...</td>
        <td>...</td>
    </tr>
    <tr>
        <td>24</td>
        <td>67.76625</td>
    </tr>
    <tr>
        <td>25</td>
        <td>68.27625</td>
    </tr>
</table>



In [None]:
def calculate_bill( list_of_prices, tip_percentage ):
    # compute based on inputs
    subtotal = sum(list_of_prices)
    total = subtotal * (1 + tip_percentage + 0.08875 )
    
    return total 

In [None]:
bill_tip = pd.DataFrame( index = range(16), columns = ['tip_percentage', 'total']  )
tip_range = np.arange(10, 26)
prices = [7, 15, 17, 6, 6]
for row in range(16):
    t = tip_range[row]
    bill_tip.iat[row, 0] = t
    tip = t *.01
    bill_tip.iat[row, 1] = calculate_bill(prices, tip)
bill_tip

**Exercise**

We defined a new function called `compound_interest()` which takes three inputs
+ `initial_deposit`: the amount you deposited in the savings account
+ `interest_rate`: the annual interest rate of the account (in decimal)
+ `num_years`: the number of years the initial deposit stays in the account

and outputs/returns the total amount in the savings count after the specified number of years.  (The function definition is included in the code cell below.)

Suppose that you invested 1000 dollars in a savings account that has a 2% annual interest rate that is added annually.

Create a data frame called `account` which has 
+ two columns: `num_years` and `amount`; the `amount` column will store the amount in the account for the given number of years, from year 0 to year 30.

<table>
    <tr>
        <th>num_years</th>
        <th>amount</th>
    </tr>    
    <tr>
        <td>0</td>
        <td>1000</td>
    </tr>    
    <tr>
        <td>1</td>
        <td>1020</td>
    </tr>    
    <tr>
        <td>2</td>
        <td>1040.4</td>
    </tr>    
    <tr>
        <td>...</td>
        <td>...</td>
    </tr>
    <tr>
        <td>29</td>
        <td>1775.84469029741 </td>
    </tr>
    <tr>
        <td>30</td>
        <td>1811.36158410335</td>
    </tr>
</table>

## 3. Application to Classification

Consider the second simple classifier which we did last class


**Example: Encoding a simple classifier (version 2)**

<table>
    <tr>
        <td><img src="images/lec20-knn-illustration2_wline2.jpg" width="600"></td>
        <td><img src="images/dec_tree1b.jpg" width="600"></td>
    </tr>
</table>  

It would be nice if we can write a function that:
- takes as inputs: marginal adhesion and clump thickness values
- outputs: a prediction of 0 or 1 based on the decision tree we constructed

For example:

    ma = 3 
    ct = 5
    
    predicted_class = predict_tumor_class( ma, ct )

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
cancerdata = pd.read_csv('../../../shared/datasets/cancer.csv')

In [None]:
# as we did in lesson 5, split into training and test datasets:
#      split the cancer dataset into two: training data and test data 
X = cancerdata.iloc[ :  ,  1:10  ]
Y = cancerdata['Class']

from sklearn.model_selection import train_test_split

X_train, X_test,  Y_train, Y_test = train_test_split( X, Y, test_size = 0.3 , random_state = 1110 )

# explore the training dataset

# combine X_train and Y_train into one dataset 
#  concatenate the Y_train dataset as another column on X_train
training_data = pd.concat( [ X_train, Y_train ], axis = 1  )

training_data.head()

In [None]:
test_data = pd.concat( [ X_test, Y_test ], axis = 1  )

In [None]:
test_data.head()

In [None]:
test_data.shape

In [None]:
# Note the index is not in consecutive order from 0
test_data.index = range(205)

In [None]:
test_data.head()

In [None]:
# this is a simple classifier that we constructed in the notebook for Lesson05_Pt2

# the decision tree:
# If marginal_adhesion is less than 4 AND clump_thickness is less than 7, the tumor is classified as 0 (benign); 
# else, it is classified as 1 (malignant)


# example of a new data point
marginal_adhesion = 2
clump_thickness = 1

if (marginal_adhesion < 4 and clump_thickness < 7):
    class_predicted = 0
else:
    class_predicted = 1

class_predicted

**We will "wrap" our classifier as a task done by a new function** which we will name `predict_tumor_class()`

+ inputs: two numbers: `marginal_adhesion` and `clump_thickness`
+ output: one number: 0 if we predict the tumor to be benign, 1 otherwise

`Z = predict_tumor_class( X, Y )`

where
+ X = marginal adhesion value
+ Y = clump thickness value
+ Z = the prediction that your decision tree classifier makes for the given values of X and Y

In [None]:
def predict_tumor_class( x , y ):
    # x = marginal_adhesion
    # y = clump thickness
    
    if (x < 4 and y < 7):
        class_predicted = 0
    else:
        class_predicted = 1
    
    return( class_predicted )

In [None]:
predict_tumor_class(2, 1)

In [None]:
predict_tumor_class( X_test.iloc[0].at[ 'Marginal Adhesion'], X_test.iloc[0].at[ 'Clump Thickness'] ) # row 1

In [None]:
# predict the class of the first four rows of the test dataset,

predict_tumor_class( X_test.iloc[0].at[ 'Marginal Adhesion'], X_test.iloc[0].at[ 'Clump Thickness'] ) # row 1

predict_tumor_class( X_test.iloc[1].at[ 'Marginal Adhesion'], X_test.iloc[1].at[ 'Clump Thickness'] ) # row 2

predict_tumor_class( X_test.iloc[2].at[ 'Marginal Adhesion'], X_test.iloc[2].at[ 'Clump Thickness'] ) # row 3

predict_tumor_class( X_test.iloc[3].at[ 'Marginal Adhesion'], X_test.iloc[3].at[ 'Clump Thickness'] ) # row 4

# etc, but there are 205 rows, so we don't want to do this by hand.


# note that we are REPEATING the same command, but with different row numbers
#  that is, each line is of the form

# predict_tumor_class( X_test.iloc[row].at[ 'Marginal Adhesion'], X_test.iloc[row].at[ 'Clump Thickness'] )

# also note that the predictions are not being saved anywhere





In [None]:
# create an empty data frame, 1 column, 205 rows

predictions =pd.DataFrame(index=range(205), columns=['class_predicted'])

# store the predictions in this data frame
for row in range(205):
    output = predict_tumor_class( X_test.iloc[row].at[ 'Marginal Adhesion'], X_test.iloc[row].at[ 'Clump Thickness'] )
    predictions.iat[row, 0] = output

# Next, check how good our predictions are, by comparing to the actual class

# add a second column containing the actual class into the predictions data frame
# the method below requires the indexes to match, had to rename the index above.
predictions['class_actual'] = test_data['Class']
### Alternatives to adding a new column
# start off creating the column in the beginning: 
# then use a for loop to enter the entries one by one
# for row in np.arange(0, num_rows_test):
#    predictions.iloc[row, 1] = predict_tumor_class( X.iloc[row, 0], X.iloc[row, 1] )
#    predictions.iloc[row, 0] = y_test.iloc[row]
# or use insert function: predictions.insert(1, 'class_actual', pd.Series(np.empty(205, 1), index=range(205)))
# then use the for loop

# count how many predictions are incorrect and how many are correct

# add a new column called "error"
predictions['error'] = abs( predictions['class_actual'] - predictions['class_predicted'] )


predictions.head()

num_incorrect = sum(predictions['error'])
num_correct = 205 - num_incorrect

print(num_incorrect)

percent_incorrect = num_incorrect/205
accuracy = num_correct/205  # percent correct

print(accuracy)

predictions['class_actual']

### Scikit-Learn/Python Decision Tree Classification

There is a Decision Tree Classification built-in Scikit-Learn. But the mathematics involved is a bit complicated and will take several lectures to explain. To learn more, here is a brief lesson about the Decision Tree Classifier: https://www.datacamp.com/tutorial/decision-tree-classification-python

Below is a brief overview on how to use it.

To import the Decision Tree Classifier:

    from sklearn.tree import DecisionTreeClassifier

To visualize your tree:

    from sklearn.tree import export_graphviz
    from six import StringIO
    from IPython.display import Image
    from pydotplus import graph_from_dot_data

Accuracy metrics:

    from sklearn.metrics import accuracy_score

To create and train:

    # Create a decision tree object
    clf2 = DTC()

    # Train Decision Tree Classifier
    clf2 = clf2.fit(X_train[['Uniformity of Cell Size', 'Clump Thickness']],Y_train)

To predict and compute its accuracy:

    # Predict the response for test dataset
    y_pred2 = clf2.predict(X_test[['Uniformity of Cell Size', 'Clump Thickness']])

    print("Accuracy:",accuracy_score(Y_test, y_pred2))

To visualize the decision tree levels:

    dot_data = StringIO()
    export_graphviz(clf2, \
    out_file=dot_data, filled=True, rounded=True, \
    special_characters=True,feature_names = ['Uniformity of Cell Size', 'Clump Thickness'],class_names=['0','1'])
    graph = graph_from_dot_data(dot_data.getvalue())
    graph.write_png('c2.png')
    Image(graph.create_png())

In [None]:
from sklearn.tree import DecisionTreeClassifier as DTC
# Create a decision tree object
clf2 = DTC()

# Train Decision Tree Classifier
clf2 = clf2.fit(X_train[['Marginal Adhesion', 'Clump Thickness']],Y_train)

In [None]:
from sklearn.metrics import accuracy_score
# Predict the response for test dataset
y_pred2 = clf2.predict(X_test[['Marginal Adhesion', 'Clump Thickness']])

print("Accuracy:",accuracy_score(Y_test, y_pred2))

In [None]:
from sklearn.tree import export_graphviz
from six import StringIO
from IPython.display import Image
from pydotplus import graph_from_dot_data

dot_data = StringIO()
export_graphviz(clf2, \
out_file=dot_data, filled=True, rounded=True, \
special_characters=True,feature_names = ['Marginal Adhesion', 'Clump Thickness'],class_names=['0','1'])
graph = graph_from_dot_data(dot_data.getvalue())
graph.write_png('c2.png')
Image(graph.create_png())

#### Miscellaneous Jupyter Notebook Tips

To increase the indentation of an entire block of code: highlight the code, then
+ Ctrl + ]

To decrease indentation:
+ Ctrl + [

To comment/uncomment an entire block of code:
+ Ctrl + /