<a href="https://colab.research.google.com/github/YukiLiu1029/Quiz/blob/main/Copy_of_apply_a_function_to_every_row_in_a_pandas_dataframe_QTM350.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import pandas as pd

## Use `.apply` to send a column of every row to a function

You can use `.apply` to send a single column to a function. This is useful when cleaning up data - converting formats, altering values etc.

In [4]:
# What does our data look like?
df = pd.read_csv("https://raw.githubusercontent.com/jeremyallenjacobson/RealRootsReproduction/master/train-1-10-2.csv", header=None)
df.head(10)

Unnamed: 0,0,1,2,3
0,8,6,10,0
1,4,9,8,0
2,6,2,5,0
3,3,6,6,0
4,5,10,8,0
5,10,2,6,0
6,8,8,9,0
7,5,9,1,1
8,9,10,5,0
9,5,3,5,0


In [5]:
df.columns = ['a','b','c','Real root']
df

Unnamed: 0,a,b,c,Real root
0,8,6,10,0
1,4,9,8,0
2,6,2,5,0
3,3,6,6,0
4,5,10,8,0
...,...,...,...,...
695,5,3,1,0
696,4,8,3,1
697,5,5,8,0
698,3,1,10,0


Below we define the discriminant function which takes as input `x=[a,b,c]` a list of coefficients of a polynomial
$$
ax^2+bx+c
$$
and returns its mathematical discriminant
 $$b^2-4ac$$

In [6]:


def discriminant(x):
    return x[1]**2 - 4*x[0]*x[2]

def degree(x):
  if x[0]!=0:
    d = 2
  elif x[1]!=0:
    d = 1
  else:
    d = 0
  return d
discriminant([1,2,3])
degree([1,2,3])
degree([0,0,4])

0

Now let's apply this function, using the previous columns as input. 

### How do we pass in the entries of the columns as input?
We select the first three columns using `loc` and use *slicing* by passing to the column index `0:2`. We pass `:` to the row index as we want all rows.

In [7]:
df.iloc[:,0:3]

Unnamed: 0,a,b,c
0,8,6,10
1,4,9,8
2,6,2,5
3,3,6,6
4,5,10,8
...,...,...,...
695,5,3,1
696,4,8,3
697,5,5,8
698,3,1,10


Now, we apply our function to these first three columns, and save the result in a new column called `Discriminant`. It is aptly named, as its value in any row is simply the value of the mathematical discriminant of the polynomial determined by the coefficients in that row.

In [8]:
df['Discriminant'] = df.iloc[:,:3].apply(discriminant, axis=1)
df['Degree'] = df.iloc[:,:3].apply(degree, axis=1)

In [9]:
# Take a peek
df.head(10)

Unnamed: 0,a,b,c,Real root,Discriminant,Degree
0,8,6,10,0,-284,2
1,4,9,8,0,-47,2
2,6,2,5,0,-116,2
3,3,6,6,0,-36,2
4,5,10,8,0,-60,2
5,10,2,6,0,-236,2
6,8,8,9,0,-224,2
7,5,9,1,1,61,2
8,9,10,5,0,-80,2
9,5,3,5,0,-91,2


In [10]:
df.to_csv('newdata.csv')


In [11]:
!ls

newdata.csv  sample_data


In [12]:
!aws s3 cp newdata.csv s3://mybucket/newdata.csv

/bin/bash: aws: command not found


So what does the column named `3` represent? It indicates, with a 0 or 1, whether or not the polynomial in that row has a real root or not. Recall, a root is a value $x$ for which
$$ ax^2+bx+c =0$$

Notice, the only appearance of 1 in the `3` column in the sample above occurs when the `Discriminant` is positive. Indeed, this is a property of the discriminant. It is positive if and only if there is a non-zero real root.

## Use `.apply` with `axis=1` to send every single row to a function

You can also send an **entire row at a time** instead of just a single column. Use this if you need to use **multiple columns to get a result**.

In [None]:
# Create a dataframe from a list of dictionaries
rectangles = [
    { 'height': 40, 'width': 10 },
    { 'height': 20, 'width': 9 },
    { 'height': 3.4, 'width': 4 }
]

rectangles_df = pd.DataFrame(rectangles)
rectangles_df

Unnamed: 0,height,width
0,40.0,10
1,20.0,9
2,3.4,4


In [None]:
# Use the height and width to calculate the area
def calculate_area(row):
    return row['height'] * row['width']

rectangles_df.apply(calculate_area, axis=1)

0    400.0
1    180.0
2     13.6
dtype: float64

In [None]:
# Use .apply to save the new column if we'd like
rectangles_df['area'] = rectangles_df.apply(calculate_area, axis=1)
rectangles_df

Unnamed: 0,height,width,area
0,40.0,10,400.0
1,20.0,9,180.0
2,3.4,4,13.6


To save the new dataframe as a csv, we use the command below.

In [None]:
rectangles_df.to_csv('area.csv')

Then, we can see that our new file appears.

In [None]:
!ls

area.csv  sample_data


From here, if we were running this notebook in sagemaker, it would be easy to copy this file to our S3 bucket using shell commands. Below are the instructions for that.

Alternatively, you can use git. Git clone your repo to this notebook instance, then commit and push the file area.csv.

#### Copying a local file to S3 (only works if AWS CLI installed)

Indeed, if you use sagemaker the AWS CLI comes preinstalled, so there would be no need to authenticate as we already gave our Sagemaker instance an IAM role allowing it to access all S3 buckets.  

The following cp command copies a single file to a specified bucket, here named 'mybucket' and key:

In [None]:
!aws s3 cp area.csv s3://mybucket/area.csv

/bin/bash: aws: command not found


For more AWS CLI commands for working with S3, see the examples in the reference [here](https://docs.aws.amazon.com/cli/latest/reference/s3/index.html).