### Functions on Pandas Dataframe


* [Aggregate functions: Groupby](#fun-first)
* [Custom function: apply](#fun-second)
* [Uber NY taxi dataset](#fun-third)


### Groupby operation <a class="anchor" id="fun-first"></a>

In [47]:
import pandas as pd
df = pd.read_csv('misc/studentmarks2.csv', sep=",", header=None)
df.columns = ['Name','Marks1','Marks2']

df['Grade'] = ['Fourth','Fourth','Third',"Third","Third","Second","Second","Second","Third","Second","Second" ]


In [2]:
df.shape

(11, 4)

In [3]:
# What is the mean score of students of a particular grade? To find out we groupby the grade and take the mean.
import numpy as np
df_grade = df[ ['Marks1','Marks2'] ].groupby(df['Grade'])



In [4]:
type(df_grade)

pandas.core.groupby.generic.DataFrameGroupBy

In [5]:
# We can apply aggregate operations like count, sum, max, min, mean, etc. on a group by object
df_grade.mean()


Unnamed: 0_level_0,Marks1,Marks2
Grade,Unnamed: 1_level_1,Unnamed: 2_level_1
Fourth,22.5,35.0
Second,29.4,26.8
Third,32.5,21.25


In [6]:
df_grade.max()

Unnamed: 0_level_0,Marks1,Marks2
Grade,Unnamed: 1_level_1,Unnamed: 2_level_1
Fourth,25,45
Second,45,34
Third,40,30


In [7]:
g = df_grade.max().reset_index()

In [8]:
g

Unnamed: 0,Grade,Marks1,Marks2
0,Fourth,25,45
1,Second,45,34
2,Third,40,30


### Custom functions: Apply method <a class="anchor" id="fun-second"></a>

We can use the method <code>apply</code> that allows us to manipulate data using custom function. As an example, here is how one would find the standard deviation of the marks. There is no built-in aggregate function for std. dev but we can use a fuction from <code>numpy</code>.


In [9]:
import numpy as np

df_grade.apply(np.std) 


Unnamed: 0_level_0,Marks1,Marks2
Grade,Unnamed: 1_level_1,Unnamed: 2_level_1
Fourth,2.5,10.0
Second,10.928861,5.74108
Third,8.291562,6.495191


#### Apply function on rows

In [10]:
# We use use the apply function on rows by using the option axis=1 or axis='columns'. 
# Here we give an alternate way to get the total column using apply.

def sumup(row):
    return row['Marks1'] + row['Marks2']

df['Total'] = df.apply(sumup,axis='columns')

In [11]:
df

Unnamed: 0,Name,Marks1,Marks2,Grade,Total
0,priya,25,25,Fourth,50
1,sandesh,20,45,Fourth,65
2,adil,30,30,Third,60
3,ranjan,40,25,Third,65
4,shubha,20,15,Third,35
5,james,15,34,Second,49
6,himanshu,20,20,Second,40
7,aryan,37,20,Second,57
8,soumya,40,15,Third,55
9,vikram,45,30,Second,75


In [52]:
df['Name'].apply(str.capitalize)

0        Priya
1      Sandesh
2         Adil
3       Ranjan
4       Shubha
5        James
6     Himanshu
7        Aryan
8       Soumya
9       Vikram
10        Asha
Name: Name, dtype: object

In [51]:
df['Name'].apply(lambda x:x.capitalize())

0        Priya
1      Sandesh
2         Adil
3       Ranjan
4       Shubha
5        James
6     Himanshu
7        Aryan
8       Soumya
9       Vikram
10        Asha
Name: Name, dtype: object

### Problem: Uber taxi drives <a class="anchor" id="fun-third"></a>

We have a dataset about 10000 Uber rides from one day in New York city. The columns give information about time and location of each pickup. 

In [12]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

uber_df = pd.read_csv('misc/uber-apr14.csv')
uber_df.columns

Index(['Id', 'Date/Time', 'Lat', 'Lon', 'Base'], dtype='object')

In [13]:
uber_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
Id           10000 non-null int64
Date/Time    10000 non-null object
Lat          10000 non-null float64
Lon          10000 non-null float64
Base         10000 non-null object
dtypes: float64(2), int64(1), object(2)
memory usage: 390.7+ KB


In [14]:
uber_df.head()

Unnamed: 0,Id,Date/Time,Lat,Lon,Base
0,0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


**Q1. Datetime format**. The information about dataframe shows that the column 'Date/Time' is in object format. Convert the column to pandas datetime format. Use the function <code>to_datetime</code> as given below.

In [17]:
pd.to_datetime('1/6/2014 0:11:00').weekday()


0

**Q2.** Add a new column called 'Day' denoting the day of the week. Values of this column should be from 1 to 7. 

In [69]:
uber_df['Date/Time']=uber_df['Date/Time'].apply(pd.to_datetime)

In [70]:
uber_df['Day']=uber_df['Date/Time'].apply(lambda x:x.weekday()+1)

In [108]:
uber_df['date']=uber_df['Date/Time'].apply(lambda x:x.date())


In [109]:
uber_df.head()

Unnamed: 0,Id,Date/Time,Lat,Lon,Base,Day,date
0,0,2014-04-01 00:11:00,40.769,83.0,B02512,2,2014-04-01
1,1,2014-04-01 00:17:00,40.7267,-74.0345,B02512,2,2014-04-01
2,2,2014-04-01 00:21:00,40.7316,-73.9873,B02512,2,2014-04-01
3,3,2014-04-01 00:28:00,40.7588,-73.9776,B02512,2,2014-04-01
4,4,2014-04-01 00:33:00,40.7594,-73.9722,B02512,2,2014-04-01


In [111]:
uber_df.groupby(['Day','date']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Id,Date/Time,Lat,Lon,Base
Day,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,2014-04-07,1376,1376,1376,1376,1376
2,2014-04-01,1011,1011,1011,1011,1011
2,2014-04-08,839,839,839,839,839
3,2014-04-02,1336,1336,1336,1336,1336
4,2014-04-03,1482,1482,1482,1482,1482
5,2014-04-04,1827,1827,1827,1827,1827
6,2014-04-05,1309,1309,1309,1309,1309
7,2014-04-06,820,820,820,820,820


In [116]:
uber_df.groupby(['Day','date']).count().groupby('Day').mean()

Unnamed: 0_level_0,Id,Date/Time,Lat,Lon,Base
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1376,1376,1376,1376,1376
2,925,925,925,925,925
3,1336,1336,1336,1336,1336
4,1482,1482,1482,1482,1482
5,1827,1827,1827,1827,1827
6,1309,1309,1309,1309,1309
7,820,820,820,820,820


In [104]:
uber_df.head()

Unnamed: 0,Id,Date/Time,Lat,Lon,Base,Day
0,0,2014-04-01 00:11:00,40.769,83.0,B02512,2
1,1,2014-04-01 00:17:00,40.7267,-74.0345,B02512,2
2,2,2014-04-01 00:21:00,40.7316,-73.9873,B02512,2
3,3,2014-04-01 00:28:00,40.7588,-73.9776,B02512,2
4,4,2014-04-01 00:33:00,40.7594,-73.9722,B02512,2


In [97]:
uber_df.shape

(10000, 6)

**Q4.** By plotting the points we can see that most of the rides happen within the city. We can restrict the plot to a fixed intervals using <code>xlim</code> and <code>ylim</code>. Find two longitude values such that 80% of the points are contained within them. Use the quantile function. Do the same for latitude values and plot the figure again.

In [None]:

plt.figure(figsize=(12, 12))
plt.plot(data['Lon'], data['Lat'], '.',alpha=0.5)
#plt.xlim(-74.2,-73.8)


In [24]:
x=uber_df['Lon'].quantile(0.8);y=uber_df['Lat'].quantile(0.8)

In [25]:
g=uber_df['Lon'].min()

In [30]:
sum((uber_df['Lon']>g) & (uber_df['Lon']<x))/len(uber_df['Lon'])

0.7995

In [35]:
len(uber_df['Lon'])

10000

In [34]:
sum((uber_df['Lon']>g) & (uber_df['Lon']<x))

7995

In [36]:
sr=pd.Series([1,1,1,3,5,8,8,10,11])
sr.quantile(0.5)

5.0

In [54]:
sr.quantile(0.8)

8.8

In [53]:
sr[int(len(sr)*0.8)]

10

In [55]:
sr

0     1
1     1
2     1
3     3
4     5
5     8
6     8
7    10
8    11
dtype: int64

In [56]:
sr[0]=30

In [57]:
sr


0    30
1     1
2     1
3     3
4     5
5     8
6     8
7    10
8    11
dtype: int64

In [58]:
uber_df

Unnamed: 0,Id,Date/Time,Lat,Lon,Base
0,0,4/1/2014 0:11:00,40.7690,-73.9549,B02512
1,1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4,4/1/2014 0:33:00,40.7594,-73.9722,B02512
5,5,4/1/2014 0:33:00,40.7383,-74.0403,B02512
6,6,4/1/2014 0:39:00,40.7223,-73.9887,B02512
7,7,4/1/2014 0:45:00,40.7620,-73.9790,B02512
8,8,4/1/2014 0:55:00,40.7524,-73.9960,B02512
9,9,4/1/2014 1:01:00,40.7575,-73.9846,B02512


In [61]:
uber_df['Lon'][0]

-73.9549

In [64]:
uber_df['Lon'][0]=189

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [65]:
uber_df.head()

Unnamed: 0,Id,Date/Time,Lat,Lon,Base
0,0,4/1/2014 0:11:00,40.769,189.0,B02512
1,1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


In [67]:
uber_df.at[0,'Lon']=83

In [68]:
uber_df

Unnamed: 0,Id,Date/Time,Lat,Lon,Base
0,0,4/1/2014 0:11:00,40.7690,83.0000,B02512
1,1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4,4/1/2014 0:33:00,40.7594,-73.9722,B02512
5,5,4/1/2014 0:33:00,40.7383,-74.0403,B02512
6,6,4/1/2014 0:39:00,40.7223,-73.9887,B02512
7,7,4/1/2014 0:45:00,40.7620,-73.9790,B02512
8,8,4/1/2014 0:55:00,40.7524,-73.9960,B02512
9,9,4/1/2014 1:01:00,40.7575,-73.9846,B02512
