# Pandas 

We will continue with Pandas and cover the following topics

- [Datetime in Pandas](#datetime-in-pandas)
- [Change Column Names](#change-column-names)
- [Change Index Names](#change-index-names)
- [Combining Dataframes](#combining-dataframes)
- [Handling Missing Data](#handling-missing-data)


## Datetime in Pandas

Now let's dive into the datetime column type with Parking data set, where each row corresponds to a different parking ticket given in NYC.

In [1]:
import pandas as pd

# We will look at another dataset
df_parking = pd.read_csv("Data/Parking.csv")
df_parking.head()

Unnamed: 0.1,Unnamed: 0,Registration_State,Issue_Date,Vehicle_Make
0,0,NY,6/30/16 2:17,TOYOT
1,1,NY,7/4/16 1:18,ME/BE
2,2,NY,7/11/16 6:15,LINCO
3,3,NY,7/4/16 1:10,NISSA
4,4,NY,7/1/16 6:30,VOLKS


I don't like having that Unnamed column. I can fix this by telling pandas that I want that column to be the index instead of a separate column.

In [2]:
#Read in parking and specify column that will serve as index
df_parking = pd.read_csv("Data/Parking.csv", index_col=0)
df_parking.head()

Unnamed: 0,Registration_State,Issue_Date,Vehicle_Make
0,NY,6/30/16 2:17,TOYOT
1,NY,7/4/16 1:18,ME/BE
2,NY,7/11/16 6:15,LINCO
3,NY,7/4/16 1:10,NISSA
4,NY,7/1/16 6:30,VOLKS


Now let's have a look at how pandas read in each column. Note that dataframe always uses the **same** type for the whole column.

In [5]:
#look at how each column is stored
df_parking.dtypes

Registration_State    object
Issue_Date            object
Vehicle_Make          object
dtype: object

In [12]:
from datetime import datetime

today = datetime(2024, 3, 9)
# today.month
# today.day



6

Although we can store date/time as string, it doesn't support any time-related operations. 
For example, how do I extract the month from the string?
We want Issue_Date to be a `datetime` and not a string. Let's convert it.

In [13]:
# Reset the column Issue Date to be a datetime
df_parking["Issue_Date"] = pd.to_datetime(df_parking["Issue_Date"]) 

# Now its a datetime object
df_parking.dtypes

Registration_State            object
Issue_Date            datetime64[ns]
Vehicle_Make                  object
dtype: object

In [15]:
df_parking[df_parking['Issue_Date'].dt.month == 5]

Unnamed: 0,Registration_State,Issue_Date,Vehicle_Make
217,NY,2016-05-28 04:58:00,CHRYS
218,NY,2016-05-26 19:02:00,HONDA
220,NY,2016-05-17 19:35:00,CHRYS
222,NY,2016-05-18 12:36:00,ROVER
223,CA,2016-05-18 02:04:00,KENWO
226,NY,2016-05-22 05:30:00,FORD
346,NY,2016-05-03 08:50:00,TOYOT
349,NJ,2016-05-25 11:43:00,FORD
359,NY,2015-05-26 14:54:00,FIAT
363,RI,2016-05-19 01:05:00,HONDA


In [17]:
df_parking['Vehicle_Make'].str.lower()

# df_parking['Issue_Date'].dt.day

0      toyot
1      me/be
2      linco
3      nissa
4      volks
       ...  
494    mercu
495    frueh
496    mitsu
497       kw
498    inter
Name: Vehicle_Make, Length: 499, dtype: object

In [3]:
import datetime
x = datetime.datetime(2020, 5, 17)
print(x.day)
print(x.month)
print(x.hour)
print(x.year)
print(str(x.today()))

17
5
0
2020
2023-11-12 17:49:41.188049


Let's look at the first entry of this column, that is now a datetime column

In [7]:
first_entry = df_parking.loc[0,"Issue_Date"]
first_entry

Timestamp('2016-06-30 02:17:00')

We see that it is a timestamp.  Timestamps have lots of nice attributes that we can extract.

In [8]:
#Get the day
first_entry.day

30

In [58]:
#Get the month
first_entry.month

6

In [59]:
# We can get the weekday name
first_entry.day_name()

'Thursday'

In [60]:
#We can even see if the year is a leap year
first_entry.is_leap_year

True

In [17]:
df_parking["Issue_Date"].dt.day

0      30
1       4
2      11
3       4
4       1
       ..
494    12
495    17
496     7
497     9
498    23
Name: Issue_Date, Length: 499, dtype: int64

If we want to get these attributes for an entire column, then we have to use `.dt` to access `pandas.Series` for the datetime properties.

In [62]:
#Get the day of the week for the entire column
all_dow = df_parking["Issue_Date"].dt.day_name()
all_dow.head()

0    Thursday
1      Monday
2      Monday
3      Monday
4      Friday
Name: Issue_Date, dtype: object

In [63]:
#Lets add this column in 
df_parking["DOW"] = df_parking["Issue_Date"].dt.day_name()
df_parking.head()

Unnamed: 0,Registration_State,Issue_Date,Vehicle_Make,DOW
0,NY,2016-06-30 02:17:00,TOYOT,Thursday
1,NY,2016-07-04 01:18:00,ME/BE,Monday
2,NY,2016-07-11 06:15:00,LINCO,Monday
3,NY,2016-07-04 01:10:00,NISSA,Monday
4,NY,2016-07-01 06:30:00,VOLKS,Friday


In [64]:
#Let's see the most frequent days for parking tickets
df_parking["DOW"].value_counts()

Thursday     81
Tuesday      80
Friday       73
Sunday       73
Wednesday    68
Monday       64
Saturday     60
Name: DOW, dtype: int64

In [18]:
#We can even do time arithmetic.  We get a TimeDelta object
lastest = df_parking.Issue_Date.max()
print(lastest)
earliest = df_parking.Issue_Date.min()
print(earliest)
delta = lastest - earliest
delta

2017-06-18 00:15:00
2015-05-26 14:54:00


Timedelta('753 days 09:21:00')

In [66]:
delta?

[1;31mType:[0m        Timedelta
[1;31mString form:[0m 753 days 09:21:00
[1;31mFile:[0m        c:\users\admin-chenni14\anaconda3\lib\site-packages\pandas\_libs\tslibs\timedeltas.cp38-win_amd64.pyd
[1;31mDocstring:[0m  
Represents a duration, the difference between two dates or times.

Timedelta is the pandas equivalent of python's ``datetime.timedelta``
and is interchangeable with it in most cases.

Parameters
----------
value : Timedelta, timedelta, np.timedelta64, str, or int
unit : str, default 'ns'
    Denote the unit of the input, if input is an integer.

    Possible values:

    * 'W', 'D', 'T', 'S', 'L', 'U', or 'N'
    * 'days' or 'day'
    * 'hours', 'hour', 'hr', or 'h'
    * 'minutes', 'minute', 'min', or 'm'
    * 'seconds', 'second', or 'sec'
    * 'milliseconds', 'millisecond', 'millis', or 'milli'
    * 'microseconds', 'microsecond', 'micros', or 'micro'
    * 'nanoseconds', 'nanosecond', 'nanos', 'nano', or 'ns'.

**kwargs
    Available kwargs: {days, seconds, m

In [67]:
#We can pull out attributes
delta.seconds, delta.days

(33660, 753)

In [68]:
# Lets say I wanted to subtract 5 hours from the first parking ticket
firstTicket = df_parking.loc[0,"Issue_Date"]

#We can create a TimeDelta Object
timeDiff = pd.Timedelta(hours = 5)

print(firstTicket)
print(firstTicket - timeDiff)


2016-06-30 02:17:00
2016-06-29 21:17:00


*(Exercise)*: Create a column that computes the time difference of the issue date relative to the earliest ticket.

## Change Column Names

In this part, we will go back to the grade dataset and learn a number of other tricks and concepts.

- Changing columns names
- Combining dataframes
- Understanding the index
- Missing Data

In [18]:
import pandas as pd

#Read in the data frame
df=pd.read_csv("Data/Grades.csv", header=0)

df.head()

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade
0,Ningyuan,32.0,1,19.5,20.0,1,10.0,33.0,A
1,Joe,32.0,1,20.0,16.0,1,14.0,32.0,A
2,Susan,30.0,1,19.0,19.0,1,10.5,33.0,A-
3,Otto,31.0,1,22.0,13.0,1,13.0,34.0,A
4,Chris,30.0,1,19.0,17.0,1,12.5,33.5,A


Recall that we can get the column names through the attribute column

In [19]:
#Get the column names
df.columns

Index(['Name', 'Previous_Part', 'Participation1', 'Mini_Exam1', 'Mini_Exam2',
       'Participation2', 'Mini_Exam3', 'Final', 'Grade'],
      dtype='object')

We can change column names through the rename method

In [20]:
#Change the column names
df.rename(columns={"Participation1":"Participation_1", "Participation2":"Participation_2"}, inplace=True)

df.head()

Unnamed: 0,Name,Previous_Part,Participation_1,Mini_Exam1,Mini_Exam2,Participation_2,Mini_Exam3,Final,Grade
0,Ningyuan,32.0,1,19.5,20.0,1,10.0,33.0,A
1,Joe,32.0,1,20.0,16.0,1,14.0,32.0,A
2,Susan,30.0,1,19.0,19.0,1,10.5,33.0,A-
3,Otto,31.0,1,22.0,13.0,1,13.0,34.0,A
4,Chris,30.0,1,19.0,17.0,1,12.5,33.5,A


The format for the columns input is {"old_column_name":"new_column_name"}. It should be noted that the rename method can also be applied to change the index by changing columns to index.

## Change Index Names

Currently, for the dataframe df, the index is just the row numbers. 
But sometimes it is more convenient to index the rows by, say, the names.

In [21]:
#Setting the column Name to be the index
df.set_index("Name", inplace = True)

df.head()

Unnamed: 0_level_0,Previous_Part,Participation_1,Mini_Exam1,Mini_Exam2,Participation_2,Mini_Exam3,Final,Grade
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Ningyuan,32.0,1,19.5,20.0,1,10.0,33.0,A
Joe,32.0,1,20.0,16.0,1,14.0,32.0,A
Susan,30.0,1,19.0,19.0,1,10.5,33.0,A-
Otto,31.0,1,22.0,13.0,1,13.0,34.0,A
Chris,30.0,1,19.0,17.0,1,12.5,33.5,A


 The `inplace = True` command will change the dataframe df.  
 You may guess what happens if we don't have it.
 Now the Name column is our index and we can access: 

In [22]:
#Access Joe's info
df.loc["Joe",:]

Previous_Part      32.0
Participation_1       1
Mini_Exam1         20.0
Mini_Exam2         16.0
Participation_2       1
Mini_Exam3         14.0
Final              32.0
Grade                 A
Name: Joe, dtype: object

When setting the index, make sure you choose a column that will uniquely identify each row.
(What would be the problem if it is not the case?)

We can change the index back to row numbers using the reset_index() method.

In [27]:
#Resetting the index
df.reset_index(drop=False, inplace=True)
df.head()

Unnamed: 0,Name,Previous_Part,Participation_1,Mini_Exam1,Mini_Exam2,Participation_2,Mini_Exam3,Final,Grade
0,Ningyuan,32.0,1,19.5,20.0,1,10.0,33.0,A
1,Joe,32.0,1,20.0,16.0,1,14.0,32.0,A
2,Susan,30.0,1,19.0,19.0,1,10.5,33.0,A-
3,Otto,31.0,1,22.0,13.0,1,13.0,34.0,A
4,Chris,30.0,1,19.0,17.0,1,12.5,33.5,A


Now we are back to the original data frame. Setting `drop = False` (default) adds the old index as a new column in the dataframe instead of just deleting it.

*(Exercise)*: Change the index to `Partcipation_1`. What happens?


## Combining Dataframes

Next, we see how to combine or concatenate two (or more) data frames vertically.

In [28]:
#I can combine data frames with concat function
head = df.head()
tail = df.tail()

In [29]:
#Have a look at the variable head
head

Unnamed: 0,Name,Previous_Part,Participation_1,Mini_Exam1,Mini_Exam2,Participation_2,Mini_Exam3,Final,Grade
0,Ningyuan,32.0,1,19.5,20.0,1,10.0,33.0,A
1,Joe,32.0,1,20.0,16.0,1,14.0,32.0,A
2,Susan,30.0,1,19.0,19.0,1,10.5,33.0,A-
3,Otto,31.0,1,22.0,13.0,1,13.0,34.0,A
4,Chris,30.0,1,19.0,17.0,1,12.5,33.5,A


In [30]:
#Have a look at the variable head
tail

Unnamed: 0,Name,Previous_Part,Participation_1,Mini_Exam1,Mini_Exam2,Participation_2,Mini_Exam3,Final,Grade
14,Chrinstine,29.0,1,13.0,15.5,1,9.0,31.0,B
15,Josh,23.5,1,17.0,12.0,1,8.5,23.0,C+
16,Jackson,28.0,1,18.0,15.5,1,7.0,31.0,B
17,Vik,31.5,1,15.0,19.0,1,13.0,35.0,A
18,Sarah,22.0,1,18.0,13.0,1,9.0,21.0,C+


In [36]:
#axis=0 says stack them top to bottom. axis =1 stacks side to side 
dfConcat = pd.concat([head,tail], axis=0)
dfConcat

Unnamed: 0,Name,Previous_Part,Participation_1,Mini_Exam1,Mini_Exam2,Participation_2,Mini_Exam3,Final,Grade
0,Ningyuan,32.0,1,19.5,20.0,1,10.0,33.0,A
1,Joe,32.0,1,20.0,16.0,1,14.0,32.0,A
2,Susan,30.0,1,19.0,19.0,1,10.5,33.0,A-
3,Otto,31.0,1,22.0,13.0,1,13.0,34.0,A
4,Chris,30.0,1,19.0,17.0,1,12.5,33.5,A
14,Chrinstine,29.0,1,13.0,15.5,1,9.0,31.0,B
15,Josh,23.5,1,17.0,12.0,1,8.5,23.0,C+
16,Jackson,28.0,1,18.0,15.5,1,7.0,31.0,B
17,Vik,31.5,1,15.0,19.0,1,13.0,35.0,A
18,Sarah,22.0,1,18.0,13.0,1,9.0,21.0,C+


So the `concat` method takes a list of dataframes as the first input and also an axis input for whether you want to stock top to bottom or side to side. Note that after we stack, the index is messed up. Let's use the `reset_index` method to change the index back to row numbers.


In [37]:
dfConcat.reset_index(inplace= True, drop=True)
dfConcat

Unnamed: 0,Name,Previous_Part,Participation_1,Mini_Exam1,Mini_Exam2,Participation_2,Mini_Exam3,Final,Grade
0,Ningyuan,32.0,1,19.5,20.0,1,10.0,33.0,A
1,Joe,32.0,1,20.0,16.0,1,14.0,32.0,A
2,Susan,30.0,1,19.0,19.0,1,10.5,33.0,A-
3,Otto,31.0,1,22.0,13.0,1,13.0,34.0,A
4,Chris,30.0,1,19.0,17.0,1,12.5,33.5,A
5,Chrinstine,29.0,1,13.0,15.5,1,9.0,31.0,B
6,Josh,23.5,1,17.0,12.0,1,8.5,23.0,C+
7,Jackson,28.0,1,18.0,15.5,1,7.0,31.0,B
8,Vik,31.5,1,15.0,19.0,1,13.0,35.0,A
9,Sarah,22.0,1,18.0,13.0,1,9.0,21.0,C+


## Handling Missing Data

Missing data is common in most data analysis applications.  You have a number of options for filtering out missing data.  One option is doing it by hand or you can use the *dropna* method.

With dataframes objects, things get a little more complex.  You may want to drop rows or columns which are all NA or just those containing any NAs. *dropna* by default drops any row containing a missing value.

In [39]:
#Here we have two pieces of missing data
df_missing = pd.read_csv("Data/Missing_Data.csv")
df_missing

# NaN -> Nat a Number

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,Temp
0,Ningyuan,32.0,1,19.5,20,1,10.0,33.0,A,-1
1,Joe,,1,20.0,16,1,14.0,32.0,A,23
2,Otto,31.0,1,,13,1,13.0,34.0,A,34
3,Chris,30.0,-1,19.0,not available,1,12.5,33.5,A,72


The isnull() method returns a series or dataframe of booleans corresponding to whether the particular entries are null or not.

In [40]:
#isnull method for a data frame
df_missing.isnull()

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,Temp
0,False,False,False,False,False,False,False,False,False,False
1,False,True,False,False,False,False,False,False,False,False
2,False,False,False,True,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False


We can make sure they are all read in as NA values using the na_values input when we read in the file

In [41]:
#Notice that here the not available is turned into an NaN value
df_missing_NA = pd.read_csv("Data/Missing_Data.csv", na_values=["NaN", "not available"])
df_missing_NA

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,Temp
0,Ningyuan,32.0,1,19.5,20.0,1,10.0,33.0,A,-1
1,Joe,,1,20.0,16.0,1,14.0,32.0,A,23
2,Otto,31.0,1,,13.0,1,13.0,34.0,A,34
3,Chris,30.0,-1,19.0,,1,12.5,33.5,A,72


In [42]:
#Let's rerun the isnull() method on the Previous_Part column
df_missing_NA.isnull()

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,Temp
0,False,False,False,False,False,False,False,False,False,False
1,False,True,False,False,False,False,False,False,False,False
2,False,False,False,True,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False


Let's say we now realize that the -1 in the Participation column is a NA value.  If we add -1 to the na_values input, we will also replace the -1 in the Temp column. Luckily, we can give a dictionary to the na_values input which specifies the NA values in each columns 

In [43]:
#Note that the temp column is unaffected
df_missing_NA2 = pd.read_csv("Data/Missing_Data.csv",\
                na_values={"Previous_Part":"NA", "Participation1":-1,"Mini_Exam2":"not available"})
df_missing_NA2

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,Temp
0,Ningyuan,32.0,1.0,19.5,20.0,1,10.0,33.0,A,-1
1,Joe,,1.0,20.0,16.0,1,14.0,32.0,A,23
2,Otto,31.0,1.0,,13.0,1,13.0,34.0,A,34
3,Chris,30.0,,19.0,,1,12.5,33.5,A,72


In [44]:
df_missing_NA2['Previous_Part'].fillna(df_missing_NA2['Previous_Part'].mean(), inplace=True)
df_missing_NA2

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,Temp
0,Ningyuan,32.0,1.0,19.5,20.0,1,10.0,33.0,A,-1
1,Joe,31.0,1.0,20.0,16.0,1,14.0,32.0,A,23
2,Otto,31.0,1.0,,13.0,1,13.0,34.0,A,34
3,Chris,30.0,,19.0,,1,12.5,33.5,A,72


Now let's see how we can change/replace these NA values

In [49]:
#Get rid of all rows with an NA
# df_missing_NA2.dropna(axis=1)
df_missing_NA2.dropna() # axis=0

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,Temp
0,Ningyuan,32.0,1.0,19.5,20.0,1,10.0,33.0,A,-1
1,Joe,31.0,1.0,20.0,16.0,1,14.0,32.0,A,23


*(Question):* What does `axis=0` do? What happens if we use a different value?

In [50]:
#Passing how='all' will only drop rows that are all NA (doesn't change anything)
df_missing_NA2.dropna(how='all') # any

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,Temp
0,Ningyuan,32.0,1.0,19.5,20.0,1,10.0,33.0,A,-1
1,Joe,31.0,1.0,20.0,16.0,1,14.0,32.0,A,23
2,Otto,31.0,1.0,,13.0,1,13.0,34.0,A,34
3,Chris,30.0,,19.0,,1,12.5,33.5,A,72


In [11]:
#Dropping column is just a matter of passing axis=1 (doesn't change anything)
df_missing_NA2.dropna(axis=1,how='any')

Unnamed: 0,Name,Mini_Exam1,Participation2,Mini_Exam3,Final,Grade,Temp
0,Ningyuan,19.5,1,10.0,33.0,A,-1
1,Joe,20.0,1,14.0,32.0,A,23
2,Otto,22.0,1,13.0,34.0,A,34
3,Chris,19.0,1,12.5,33.5,A,72


Rather than filtering ou missing data, you may want to fill in the "holes" in any number of ways. For most purposes, the *fillna* method with a constant relplaces missing values with that value.

In [51]:
df_missing_NA2.fillna(0)

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,Temp
0,Ningyuan,32.0,1.0,19.5,20.0,1,10.0,33.0,A,-1
1,Joe,31.0,1.0,20.0,16.0,1,14.0,32.0,A,23
2,Otto,31.0,1.0,0.0,13.0,1,13.0,34.0,A,34
3,Chris,30.0,0.0,19.0,0.0,1,12.5,33.5,A,72


In [52]:
#You can pass fillna a dict which gives the replacement value for each column
df_missing_NA2.fillna({"Previous_Part":5,"Mini_Exam2":0.5})

Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,Temp
0,Ningyuan,32.0,1.0,19.5,20.0,1,10.0,33.0,A,-1
1,Joe,31.0,1.0,20.0,16.0,1,14.0,32.0,A,23
2,Otto,31.0,1.0,,13.0,1,13.0,34.0,A,34
3,Chris,30.0,,19.0,0.5,1,12.5,33.5,A,72


With *fillna* you can do lots of things with a little creativity.  For example, you might pass the mean of median value of a series.


In [90]:
#Replace with mean
df_missing_NA2.fillna(df_missing_NA2.mean())

  df_missing_NA2.fillna(df_missing_NA2.mean())


Unnamed: 0,Name,Previous_Part,Participation1,Mini_Exam1,Mini_Exam2,Participation2,Mini_Exam3,Final,Grade,Temp
0,Ningyuan,32.0,1.0,19.5,20.0,1,10.0,33.0,A,-1
1,Joe,31.0,1.0,20.0,16.0,1,14.0,32.0,A,23
2,Otto,31.0,1.0,22.0,13.0,1,13.0,34.0,A,34
3,Chris,30.0,1.0,19.0,16.333333,1,12.5,33.5,A,72
