# Welcome to the Dark Art of Coding:
## Introduction to Python
Datetimes in pandas

<img src='../universal_images/dark_art_logo.600px.png' width='300' style="float:right">

In [1]:
# Let's start by importing the pertinent libraries and functions

import pandas as pd
from pandas import DataFrame, Series
import numpy as np

# Date objects

Before we go into anything `pandas` specific let's talk a little bit about the built-in library `datetime`. With it we can create datetime objects which `pandas` can use for some really cool things. Keep in mind this is a `datetime` **object** NOT a **string** that looks like a date. These `datetime` objects have attributes and behaviors we can examine and call.

In [2]:
from datetime import datetime
from datetime import timedelta
date1 = datetime(2016, 5, 3)

In [None]:
# The printed display looks like a string
print('str display:', date1)

# But the item is an actual object 
print('object type: ', type(date1))

# The object itself:
date1

Looking at our datetime object tells us that it can potentially store more than just the year, month, and day. We see two zeroes we didn't define so there must be something that can go there. Let's turn to the help documentation to see if we figure this out.

In [None]:
datetime?

Thus we learn that the datetime object can also hold data representing:

* hours
* minutes
* seconds
* microseconds
* time zones

Let's take a look at the difference between two datetime objects. Let's make two dates but one has the hour value set to **12** and one has the hour value set to **11**. Let's see the time difference between the two:

In [None]:
date2 = datetime(2016, 5, 3, 12)
date3 = datetime(2016, 5, 3, 11)

difference = date2 - date3

In [None]:
# Let's take a look at this difference object
print('str display:', difference)

print('object type: ', type(difference))

In [None]:
# But let's look at the object itself
difference

Most of us were probably expecting something like "1 hour" but instead we got "3600 something". Let's try looking at the help for a timedelta and see what we can find

In [None]:
timedelta?

Well that wasn't what we really expected either. Let's try using IPython's verbose help (You can get to it with the double question marks: **??**)

In [None]:
timedelta??

The verbose help shows us the source code AND in this case, we can see some additional details the author included... such as a snarky comment on what we just saw


Any category of time that does not fit one of these three categories:

* Days
* Seconds
* Microseconds

Gets converted to the closest lower category. The hour got converted to 3600 seconds

As an object if we want access to individual time categories we can use **dot notation** to access the attributes for the datetime and timedelta objects

In [None]:
print(difference.days)
print(difference.seconds)

# Want to see the other attributes and methods?
# difference.<tab>

In [None]:
# similarly, we can see the attributes for one of our
# previous date objects:
print(date3.hour)
print(date3.day)
print(date3.month)
print(date3.year)

# Want to see the other attributes and methods?
# date3.<tab>

datetime objects have a:

* default string representation
* an ISO representation

In [None]:
# default string format:
print(str(date3))

# ISO format:
print(date3.isoformat())

In addition, the `datetime` module has the ability to both read in values and write out values using **user defined** formats.

`*.strftime()`

`*.strptime()`

# string formatting:
To format the `str` output of a `datetime` object to suit your needs, you can use `strftime()` and can mix and match from the formatting specifications

In [None]:
datef = datetime(2009, 9, 9)
datef.strftime('%Y-%m-%d')

We are not gonna go into the full collection of formatting specifications.

That is left as an exercise for the student:
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

### WARNING:

I have not confirmed this, but I have heard that the supported formatting codes vary across platforms (linux, unix, windows, mac, etc) because Python relies upon the underlying C library's strftime() function.

Purportedly, the Python Docs reflect all the formatting codes from the 1989 version of the C standard which **should be consistent across all platforms**

But some libraries may implement **additional** formatting codes.

hattip to Will McCutchen's site: http://strftime.org/

# string parsing:
Sometimes we get strings that contain dates but have unusual formatting. You can parse the string manually and convert it to a datetime object with the `strptime()` function:

In [None]:
# Presume that we have the following string that is a month, day and year separated
# by '|' symbols.
datep = '8|8|2008'
datetime.strptime(datep, '%m|%d|%Y')

# In deference to time, we are gonna leave a deep dive into the
# formatting specifications for the student.

While manually setting the formatting works, for heavy duty datetime parsing, the automagic parsing available via the dateutil module is hard to beat. 

In [3]:
from dateutil.parser import parse

d1 = '2000-01-01'
d2 = 'December 12, 2001 13:13'
d3 = '23rd January 2002 21:21:21'

for dateobject in d1, d2, d3:
    print(dateobject.ljust(35), parse(dateobject))


2000-01-01                          2000-01-01 00:00:00
December 12, 2001 13:13             2001-12-12 13:13:00
23rd January 2002 21:21:21          2002-01-23 21:21:21


In [9]:
parse('12 OCT 2001 13:13')

datetime.datetime(2001, 10, 12, 13, 13)

# Experience Points!
---

Using the `datetime` module:

* create a `datetime` of the day of your birth
* create a `datetime` of today **including** the current time
* calculate the difference between them

Once you've done that 

* make another `datetime` of your most RECENT birthday
* calculate the difference between it and today
* calculate how many hours are represented by the seconds (hint: divide the number of seconds by 3600)

When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

# `pandas`
As much as it may seem like it, this talk is NOT intended to cover just the `datetime` or `dateutil` modules... but we need to cover them to give us perspective on what `pandas` can do.

Let's start by reading in a csv. We:
* read in the csv
* provide a list of column names
* identify a column for use as an index
* tell `pandas` to automagically parse the strings into dates

In [None]:
df = pd.read_csv('log_file.csv', names=['name', 'email',
                                        'fmip', 'toip',
                                        'datetime', 'lat',
                                        'long', 'payload'],
                                 index_col='datetime',
                                 parse_dates=True)

df

From here, let's assign a label to the data held in the name column to simplify the task of referencing it.

In [None]:
names = df['name']
names

`names` is a `Series` and like any `Series`, it has an **`index`**. We can provide a label for the index as a separate entity

In [None]:
ts = names.index
ts

Just like any index... we can select for individual entities from the index using indexing and slicing.

In [None]:
ts[5]

Within pandas, you can index using the integer count (as we saw above) *OR* using a string representation of a specific timestamp.

One of the powerful aspects of selection based on index is that you can select items from the `index`, from a `Series`, or from an entire `DataFrame`

In [None]:
time = '2016-02-06 21:47:02'
names[time]

Matching against a substring from within a longer string is possible as well. Here, let's match against any item in the `Series` with the substring '2016-02-06' in the index.

In [None]:
names['2016-02-06']

For our next trick, let's use a new DataFrame and Series from a longer dataset (~1000 records).

In [None]:
df_1000 = pd.read_csv('log_file_1000.csv', names=['name', 'email',
                                                  'fmip', 'toip',
                                                  'datetime', 'lat',
                                                  'long', 'payload'],
                                                 index_col='datetime',
                                                 parse_dates=True)

names_long = df_1000['name']

NOTE: many of the most common interpretations of a 'date and time', even if they are NOT a letter-for-letter match of the string will work for selecting dates.

This first item will bring back 740+ records that have 2015 in the year.

In [None]:
nl = names_long['2015']
nl

In [None]:
names_long['September 2015']      # <--- yields ~ 181 results

In [None]:
names_long['31st Oct, 2015']       #  <--- yields ~ 9 results

If the datetime information in the DataFrame/Series is in chronological order you can use slice syntax

In [None]:
names_long['Oct, 29 2015':'Oct, 31 2015']         # <--- yields ~ 28 results

Next, we are gonna ingest a dataset, but we will apply a function to truncate the datetime string to only represent the date.

We define a simple function to split the datetime strings into dates and times, and retain only the dates. `pandas` allows us to apply the function when we read in the data using the 'converters' argument of the `read_csv()` function.

In [None]:
def date_split(dt):
    return dt.split('T')[0]

In [None]:
df2_long = pd.read_csv('log_file_1000.csv', names=['name', 'email',
                                                   'fmip', 'toip', 
                                                   'datetime', 'lat',
                                                   'long', 'payload'],
                      index_col='datetime',
                      converters={'datetime':date_split},
                      parse_dates=True)

df2_long[['name', 'lat', 'long', 'payload']]

With the new dataset, we can see many dates that duplicate or repeat, which means that we have the capability to group by those dates.

In [None]:
grouped = df2_long.name.groupby(level=0)  # groupby the zeroeth
                                          # level of the index 
                                          # hierarchy    

In [None]:
# Now, let's quickly check the size of the groups...

grouped.size()

# Experience Points
---

* Read in the same data set (`log_file_1000.csv`) as a `pandas DataFrame`
* Create separate `DataFrame` that only includes rows between the
    * 29th of December 2015
    * 5th of January 2016
    
* Confirm that your `DataFrame` only contains the dates that you want


When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

In [None]:
df2_long = pd.read_csv('log_file_1000.csv', names=['name', 'email',
                                                   'fmip', 'toip', 
                                                   'datetime', 'lat',
                                                   'long', 'payload'],
                      index_col='datetime',
                      converters={'datetime':date_split},
                      parse_dates=True)

In [None]:
# Occasionally, our time series doesn't have all the periods
# that we might want... resampling can solve that problem.
# Similarly, sometimes our time series might have too many samples

df2_long.resample('4h').mean()

In [None]:
df2_long = pd.read_csv('log_file_1000.csv', names=['name', 'email',
                                                   'fmip', 'toip', 
                                                   'datetime', 'lat',
                                                   'long', 'payload'],
                      index_col='datetime',
                      converters={'datetime':date_split},
                      parse_dates=True)

In [None]:
# When resampling, by default, pandas applies the mean if
# that processing makes sense.... 
# We will also see other ways of handling resampling.


df2_long.resample('2D').sum()

# try:
# .var()
# .std()
# etc.


In [None]:


# Frequencies are defined as a base frequency and a multiplier...
#   2M - every two months
#   2h30min - every two hours, 30 mins
#   D - daily
#   B - business daily

#   BM - End of the business month

#   W-MON - weekly on a given day of the week
#   WOM-1TUE - week of the month(1st, 2nd, etc) and day of week
#   QS-JAN - start of the quarter

Often, knowing how to handle Time Zones is important... especially in ensuring accuracy across time zones. Let's take just the first 10 lines of the column `name` in the long file. At this moment, there is no explicit time zone associated with these time stamps. This is referred to as time zone naive

In [None]:
times = df_1000.name[:10]
print(times.index.tz)
print(times)

To translate from naive to a specific timezone, we use the localize function. Common practice is to define the 'local' timezone using the standard name.

In [None]:
times_est = times.tz_localize('US/Eastern')
print(times.index.tz)
times_est

Once you set the local timezone for a TimeSeries, you can convert it to align to other time zones. In this case, we see that 

In [None]:
times_hi = times_est.tz_convert('US/Hawaii')
times_hi

# February is during that window when the
#     HI <-> EST offset is five hours.

Combinations between two different timezones will result in an output normalized to UTC

Thus far, we have dealt predominantly with potentially irregularly timestamped data. One of the other main classes of timing data is periodic time spans such as years, months, minutes.

In [None]:
period = pd.Period(2015, freq='W-MON')
period

This period represents a single time span across one week at the end of 2014/beginning of 2015, starting on a Mon.

To see a sequence or range of periods, you can use the `period_range()` function

In [None]:
per_range = pd.period_range('12/12/2010', '12/12/2015', freq='Q')
per_range

To convert a Period to a different frequency, you can use the *.asfreq() function:

In [None]:
period.asfreq('W', how='start')

# the default starting day is Sun, but this is configurable.


We can similarly convert whole sequences of periods:

In [None]:
per_range.asfreq('D', how='start')

In [None]:
df_period = df_1000.to_period('M')
df_period

To convert back to timestamps...

The day of the month data gets lost in the above conversion, so pandas resorts to a default of the first day of the month.

In [None]:
df_ts = df_period.to_timestamp(how='start')
df_ts

Once resampled we can use different methods to display our data... Remember mean is default but we can use others

Use the resample method then use the method you want on it after the fact

In [None]:
df_1000[['lat', 'long']].resample('Q', kind='period').mean()

methods you can use:
* mean
* median
* std
* first
* last
* min
* max
* var
* count

As a wrap-up, let's take a look at plotting some of this data. (For more examples of plotting, one of our earlier demo scripts has a variety of sample plots. Also, matplotlib has an extensive library of sample plots).

We'll start with setting up the interactive environment so we an manipulate the graph

In [None]:
import matplotlib.pyplot as plt


In [None]:
# From there, we create a reduced dataset and create a subplot that is two rows
# high, one column wide and we activate the first subplot. Then we set up the
# graph for that subplot to display the latitude in black circles.

In [None]:
df_1000 = df_1000[['lat', 'long']][:750]
plt.subplot(2, 1, 1)
df_1000['lat'].plot(style='ko')


# Next we activate the second subplot and set up the graph for that subplot to
# display the longitude in
plt.subplot(2, 1, 2)
df_1000['long'].plot(style='b^')

In [None]:
plt.show()