# Pandas 

We will continue with Pandas and cover the following topics

- [String Methods](#string-methods)
- [Datetime in Pandas](#datetime-in-pandas)


## String Methods

We can use the usual string methods to manipulate data inside of a dataframe. To invoke a string method use the `.str` attribute of a series. Since the only column that is stored as a string is the Name column we will work with that.


In [1]:
import pandas as pd

df_titanic = pd.read_csv("Data/Titanic.csv")
df_titanic.head(20)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
6,898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S
8,900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,901,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S


In [2]:
df_titanic.Name.split()

AttributeError: 'Series' object has no attribute 'split'

In [7]:
#Using the lower() method

df_titanic["Name"].str.lower().head()

0                                kelly, mr. james
1                wilkes, mrs. james (ellen needs)
2                       myles, mr. thomas francis
3                                wirz, mr. albert
4    hirvonen, mrs. alexander (helga e lindqvist)
Name: Name, dtype: object

- Here `.str` turn the pandas series into a string.
- We apply `.lower()` to make the string lowercase.

The above bit of code returns the Name column as a str with all lowercase letters. Another useful string method is the contains() method, which returns a boolean if the given string contain the input string. 

In [10]:
'hello'.__contains__('ll')

True

In [16]:
#using contains() method to check for title Mr.
df_titanic["Name"].str.contains("Mr\.").head()

0     True
1    False
2     True
3     True
4    False
Name: Name, dtype: bool

Note that we need the escape character "\\" to look for the ".".  We can easily add the results of the returned series in a new column as follows.
This is because "." matches any character in regex. Read more about regular expression (regex) [here](https://docs.python.org/3/library/re.html).
We can use the `regex=False` argument to avoid this.

In [17]:
#Creating new column
df_titanic["Bool_Mr"] = df_titanic["Name"].str.contains("Mr\.")
# the same as 
# df_titanic["Bool_Mr"] = df_titanic["Name"].str.contains("Mr.", regex=False)
df_titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Bool_Mr
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,True
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,False
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,True
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,True
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,False


Now we have a new boolean column and we can easily figure out, for example, the fraction of "Mr."s as follows

In [18]:
#Computing the proportion of Mr's
df_titanic.Bool_Mr.mean()

0.5741626794258373

We can also use these string methods on the column names. Let's say I want to replace all the underscores (there is only one) with blank spaces.  I can do that with the replace method. Recall that I access the column names through the columns attribute of any dataframe.

In [19]:
df_titanic.columns

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked', 'Bool_Mr'],
      dtype='object')

In [20]:
#Replace the underscore
df_titanic.columns = df_titanic.columns.str.replace("_", "")

df_titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,BoolMr
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,True
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,False
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,True
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,True
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,False


Notice the name of the column we created above has been changed.  Before we move to the next section, I will delete this column.

In [21]:
del df_titanic["BoolMr"]

df_titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


For a comprehenssive list of string methods, see the [pandas tutorial](https://pandas.pydata.org/docs/user_guide/text.html) on working with text data.

In [22]:
df_titanic.Name.unique()

array(['Kelly, Mr. James', 'Wilkes, Mrs. James (Ellen Needs)',
       'Myles, Mr. Thomas Francis', 'Wirz, Mr. Albert',
       'Hirvonen, Mrs. Alexander (Helga E Lindqvist)',
       'Svensson, Mr. Johan Cervin', 'Connolly, Miss. Kate',
       'Caldwell, Mr. Albert Francis',
       'Abrahim, Mrs. Joseph (Sophie Halaut Easu)',
       'Davies, Mr. John Samuel', 'Ilieff, Mr. Ylio',
       'Jones, Mr. Charles Cresson',
       'Snyder, Mrs. John Pillsbury (Nelle Stevenson)',
       'Howard, Mr. Benjamin',
       'Chaffee, Mrs. Herbert Fuller (Carrie Constance Toogood)',
       'del Carlo, Mrs. Sebastiano (Argenia Genovesi)',
       'Keane, Mr. Daniel', 'Assaf, Mr. Gerios',
       'Ilmakangas, Miss. Ida Livija',
       'Assaf Khalil, Mrs. Mariana (Miriam")"', 'Rothschild, Mr. Martin',
       'Olsen, Master. Artur Karl',
       'Flegenheim, Mrs. Alfred (Antoinette)',
       'Williams, Mr. Richard Norris II',
       'Ryerson, Mrs. Arthur Larned (Emily Maria Borie)',
       'Robins, Mr. Alexander

*(Exercise)*: 
- Find who has the longest name in the dataset.
- check if all the rows with "Mr." in the name also are always "male".

In [33]:
df_titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [28]:
df_titanic [df_titanic.Name.str.len() == df_titanic.Name.str.len().max()]
#df_titanic.Name.str.len()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
343,1235,1,"Cardeza, Mrs. James Warburton Martinez (Charlo...",female,58.0,0,1,PC 17755,512.3292,B51 B53 B55,C
397,1289,1,"Frolicher-Stehli, Mrs. Maxmillian (Margaretha ...",female,48.0,1,1,13567,79.2,B41,C


In [42]:
# check the rows with "Mr." in the name column have "male" in the Sex column
# all(df_titanic[df_titanic.Name.str.contains("Mr\.")].Sex == "male")
df_MR = df_titanic[df_titanic['Name'].str.contains('Mr\.')]
df_MR[df_MR.Sex == 'male']

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.2250,,S
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0000,,S
...,...,...,...,...,...,...,...,...,...,...,...
406,1298,2,"Ware, Mr. William Jeffery",male,23.0,1,0,28666,10.5000,,S
407,1299,1,"Widener, Mr. George Dunton",male,50.0,1,1,113503,211.5000,C80,C
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S


## Datetime in Pandas

Now let's dive into the datetime column type with Parking data set, where each row corresponds to a different parking ticket given in NYC.

In [43]:
import pandas as pd

# We will look at another dataset
df_parking = pd.read_csv("Data/Parking.csv")
df_parking.head()

Unnamed: 0.1,Unnamed: 0,Registration_State,Issue_Date,Vehicle_Make
0,0,NY,6/30/16 2:17,TOYOT
1,1,NY,7/4/16 1:18,ME/BE
2,2,NY,7/11/16 6:15,LINCO
3,3,NY,7/4/16 1:10,NISSA
4,4,NY,7/1/16 6:30,VOLKS


I don't like having that Unnamed column. I can fix this by telling pandas that I want that column to be the index instead of a separate column.

In [44]:
#Read in parking and specify column that will serve as index
df_parking = pd.read_csv("Data/Parking.csv", index_col=0)
df_parking.head()

Unnamed: 0,Registration_State,Issue_Date,Vehicle_Make
0,NY,6/30/16 2:17,TOYOT
1,NY,7/4/16 1:18,ME/BE
2,NY,7/11/16 6:15,LINCO
3,NY,7/4/16 1:10,NISSA
4,NY,7/1/16 6:30,VOLKS


Now let's have a look at how pandas read in each column. Note that dataframe always uses the **same** type for the whole column.

In [45]:
#look at how each column is stored
df_parking.dtypes

Registration_State    object
Issue_Date            object
Vehicle_Make          object
dtype: object

Although we can store date/time as string, it doesn't support any time-related operations. 
For example, how do I extract the month from the string?
We want Issue_Date to be a `datetime` and not a string. Let's convert it.

In [46]:
# Reset the column Issue Date to be a datetime
df_parking["Issue_Date"] = pd.to_datetime(df_parking["Issue_Date"]) 

# Now its a datetime object
df_parking.dtypes

Registration_State            object
Issue_Date            datetime64[ns]
Vehicle_Make                  object
dtype: object

Now one can sort the data by date.

In [47]:
df_parking.sort_values(by="Issue_Date", inplace=False)

Unnamed: 0,Registration_State,Issue_Date,Vehicle_Make
359,NY,2015-05-26 14:54:00,FIAT
476,NY,2016-03-10 17:30:00,NISSA
477,NY,2016-03-11 19:00:00,FORD
478,CT,2016-03-12 12:30:00,BMW
479,NY,2016-03-18 17:56:00,BMW
...,...,...,...
208,NY,2016-07-19 22:55:00,VOLKS
172,NY,2016-07-20 23:23:00,AUDI
100,NY,2016-07-31 08:07:00,MACK
314,NY,2017-04-28 03:20:00,


In [48]:
pd.to_datetime('16-10-30', format = '%y-%m-%d')

Timestamp('2016-10-30 00:00:00')

Let's look at the first entry of this column, that is now a datetime column

In [49]:
first_entry = df_parking.loc[0,"Issue_Date"]
first_entry

Timestamp('2016-06-30 02:17:00')

We see that it is a timestamp.  Timestamps have lots of nice attributes that we can extract.

In [50]:
#Get the day
first_entry.day

30

In [51]:
#Get the month
first_entry.month

6

In [52]:
# We can get the weekday name
first_entry.day_name()

'Thursday'

In [53]:
#We can even see if the year is a leap year
first_entry.is_leap_year

True

In [54]:
df_parking["Issue_Date"].dt.day 

0      30
1       4
2      11
3       4
4       1
       ..
494    12
495    17
496     7
497     9
498    23
Name: Issue_Date, Length: 499, dtype: int64

If we want to get these attributes for an entire column, then we have to use `.dt` to access `pandas.Series` for the datetime properties.

In [55]:
#Get the day of the week for the entire column
# An alternative is dayofweek
all_dow = df_parking["Issue_Date"].dt.day_name()
all_dow.head()

0    Thursday
1      Monday
2      Monday
3      Monday
4      Friday
Name: Issue_Date, dtype: object

In [56]:
#Lets add this column in 
df_parking["DOW"] = df_parking["Issue_Date"].dt.day_name()
df_parking.head()

Unnamed: 0,Registration_State,Issue_Date,Vehicle_Make,DOW
0,NY,2016-06-30 02:17:00,TOYOT,Thursday
1,NY,2016-07-04 01:18:00,ME/BE,Monday
2,NY,2016-07-11 06:15:00,LINCO,Monday
3,NY,2016-07-04 01:10:00,NISSA,Monday
4,NY,2016-07-01 06:30:00,VOLKS,Friday


In [57]:
#Let's see the most frequent days for parking tickets
df_parking["DOW"].value_counts()

Thursday     81
Tuesday      80
Sunday       73
Friday       73
Wednesday    68
Monday       64
Saturday     60
Name: DOW, dtype: int64

In [58]:
#We can even do time arithmetic.  We get a TimeDelta object
delta = df_parking.Issue_Date.max() - df_parking.Issue_Date.min()
delta

Timedelta('753 days 09:21:00')

In [60]:
delta?

In [61]:
#We can pull out attributes
delta.seconds, delta.days

(33660, 753)

In [62]:
# Lets say I wanted to subtract 5 hours from the first parking ticket
firstTicket = df_parking.loc[0,"Issue_Date"]

#We can create a TimeDelta Object
timeDiff = pd.Timedelta(hours = 5)

print(firstTicket)
print(firstTicket - timeDiff)


2016-06-30 02:17:00
2016-06-29 21:17:00


*(Exercise)*: 
- Create a column that computes the time difference of the issue date relative to the earliest ticket.
- Find the hours of the day when most tickets are given.

In [64]:
df_parking.head()

Unnamed: 0,Registration_State,Issue_Date,Vehicle_Make,DOW
0,NY,2016-06-30 02:17:00,TOYOT,Thursday
1,NY,2016-07-04 01:18:00,ME/BE,Monday
2,NY,2016-07-11 06:15:00,LINCO,Monday
3,NY,2016-07-04 01:10:00,NISSA,Monday
4,NY,2016-07-01 06:30:00,VOLKS,Friday


In [73]:
df_date = df_parking.sort_values(by="Issue_Date", inplace=False)
firstTicket = df_date.iloc[0,1]
df_parking['difference'] = df_parking['Issue_Date'] - firstTicket
df_parking.sort_values(by="Issue_Date", inplace=False)
#df_parking.Issue_Date - df_parking.Issue_Date.min()

Unnamed: 0,Registration_State,Issue_Date,Vehicle_Make,DOW,difference
359,NY,2015-05-26 14:54:00,FIAT,Tuesday,0 days 00:00:00
476,NY,2016-03-10 17:30:00,NISSA,Thursday,289 days 02:36:00
477,NY,2016-03-11 19:00:00,FORD,Friday,290 days 04:06:00
478,CT,2016-03-12 12:30:00,BMW,Saturday,290 days 21:36:00
479,NY,2016-03-18 17:56:00,BMW,Friday,297 days 03:02:00
...,...,...,...,...,...
208,NY,2016-07-19 22:55:00,VOLKS,Tuesday,420 days 08:01:00
172,NY,2016-07-20 23:23:00,AUDI,Wednesday,421 days 08:29:00
100,NY,2016-07-31 08:07:00,MACK,Sunday,431 days 17:13:00
314,NY,2017-04-28 03:20:00,,Friday,702 days 12:26:00


In [76]:
#df_parking["Issue_Date"].dt.hour.value_counts().max()
df_parking.Issue_Date.dt.hour.value_counts()

44