### Functions on Pandas Dataframe


* [Aggregate functions: Groupby](#fun-first)
* [Custom function: apply](#fun-second)
* [Uber NY taxi dataset](#fun-third)


### Groupby operation <a class="anchor" id="fun-first"></a>

In [None]:
import pandas as pd
df = pd.read_csv('misc/studentmarks2.csv', sep=",", header=None)
df.columns = ['Name','Marks1','Marks2']

df['Grade'] = ['Fourth','Fourth','Third',"Third","Third","Second","Second","Second","Third","Second","Second" ]


In [None]:
df.shape

In [None]:
# What is the mean score of students of a particular grade? To find out we groupby the grade and take the mean.
import numpy as np
df_grade = df[ ['Marks1','Marks2'] ].groupby(df['Grade'])



In [None]:
type(df_grade)

In [None]:
# We can apply aggregate operations like count, sum, max, min, mean, etc. on a group by object
df_grade.mean()


In [None]:
df_grade.max()

In [None]:
g = df_grade.max().reset_index()

In [None]:
g

### Custom functions: Apply method <a class="anchor" id="fun-second"></a>

We can use the method <code>apply</code> that allows us to manipulate data using custom function. As an example, here is how one would find the standard deviation of the marks. There is no built-in aggregate function for std. dev but we can use a fuction from <code>numpy</code>.


In [None]:
import numpy as np

df_grade.apply(np.std) 


#### Apply function on rows

In [None]:
# We use use the apply function on rows by using the option axis=1 or axis='columns'. 
# Here we give an alternate way to get the total column using apply.

def sumup(row):
    return row['Marks1'] + row['Marks2']

df['Total'] = df.apply(sumup,axis='columns')

In [None]:
df

### Problem: Uber taxi drives <a class="anchor" id="fun-third"></a>

We have a dataset about 10000 Uber rides from one day in New York city. The columns give information about time and location of each pickup. 

In [86]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

uber_df = pd.read_csv('misc/uber-apr14.csv')
uber_df.columns

Index(['Id', 'Date/Time', 'Lat', 'Lon', 'Base'], dtype='object')

In [None]:
uber_df.info()

In [None]:
uber_df.head()

**Q1. Datetime format**. The information about dataframe shows that the column 'Date/Time' is in object format. Convert the column to pandas datetime format. Use the function <code>to_datetime</code> as given below.

In [None]:
pd.to_datetime('1/6/2014 0:11:00').weekday()


**Q2.** Add a new column called 'Day' denoting the day of the week. Values of this column should be from 1 to 7. 

**Q3.** Find the average number of rides for each day of the week.

**Q4.** By plotting the points we can see that most of the rides happen within the city. We can restrict the plot to a fixed intervals using <code>xlim</code> and <code>ylim</code>. Find two longitude values such that 80% of the points are contained within them. Use the quantile function. Do the same for latitude values and plot the figure again.

In [None]:

plt.figure(figsize=(12, 12))
plt.plot(data['Lon'], data['Lat'], '.',alpha=0.5)
#plt.xlim(-74.2,-73.8)
