# timeseries

**Working with timeseries in pandas is a fullfilling to work with time-based data.**

This Cheatbook (Cheatsheet + Notebook) introduces you to the core functionality when working with pandas' time series / date functionality.


## References
* [API Reference](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)

## Timestamp

Using just pandas' time data types is fun. Pandas provides intuitive ways for working with time data.

### Single time objects
Let's create some Timestamps / point in time.

In [None]:
import pandas as pd
pd.Timestamp("today")

You can put in some standard date formats. Pandas' will convert them accordingly.

In [None]:
new_years_dinner = pd.Timestamp("2020-01-01 19:00")
new_years_dinner

We can also create relative time information

In [None]:
time_needed_to_sober_up = pd.Timedelta("1 day")
time_needed_to_sober_up

We can also do calculations with thos objects.

In [None]:
completely_sober = new_years_dinner + time_needed_to_sober_up
completely_sober

### Time series
We can work with a list of time-based data, too. Here we use pandas' `date_range` method to create such a list (with `m` for end of months).

In [None]:
dates = pd.DataFrame(
            pd.date_range("2020-03-01", periods=5, freq="m"),
            columns=["day"]
        )
dates

With this, we calculate with time in a similar way as above.

In [None]:
dates["day_after_tomorrow"] = dates['day'] + pd.Timedelta("2 days")
dates

## DateTimeProperties object

Especially the `DateTimeProperties` object contains time related data as attributes or methods that we can use.

In [None]:
dt_properties = dates['day'].dt
dt_properties

Let's take a look the some of the properties.

In [None]:
# this code is just for demonstration purposes and not needed in an analysis
[x for x in dir(dt_properties) if not x.startswith("_")]

We can e.g. call the method `day_name()` on a date time series to get the name of the day for a date.

In [None]:
dt_properties.day_name()

## Timestamp Series
Let's work with some real data (or at least a part of it). 

### Example Scenario
The following dataset is an excerpt from a change log of a software. We want to take a look at which hour of the day the changes are made to the software.

#### First try

We can read in time-based datasets as any other dataset.

In [None]:
change_log = pd.read_csv("../datasets/change_history.csv")
change_log.head()

Note, if we import a dataset like this, the time data will be of a simple object data type.

In [None]:
change_log.info()

So we have to convert that data first into a time-based data type with pandas' `to_datetime()` function.

In [None]:
change_log['timestamp'] = pd.to_datetime(change_log['timestamp'])
change_log.info()

Next, we want to see at whick hour of the day most changes were done. We can use the same strategies to get more detailed information like in the previous examples.

In [None]:
change_log['hour'] = change_log['timestamp'].dt.hour
change_log.head()

Let's simply count the number of changes per hour.

In [None]:
changes_per_hour = change_log['hour'].value_counts(sort=False)
changes_per_hour.head()

And create a little bar chart.

In [None]:
changes_per_hour.plot.bar();

At the first glance, this looks pretty fine. But there is a problem: Missing data. E.g. at 3am and 5am, there weren't any changes.

We can handle this by using the more advanced `resample` functionality of pandas. This allows us to determine at which frequency we summarize time-based data.

#### Second try: resampling time
For this, we create a time series Dataframe from the dataset again. This time, we import the dataset by additionally using the `parse_dates` keyword and the number of the column that contains dates. This would lead to an converted date column from the beginning.

In [None]:
change_log = pd.read_csv("../datasets/change_history.csv", parse_dates=[0], index_col=0)
change_log.head()

In [None]:
change_log['changes'] = 1
change_log.head()

Now we are able to apply the `resample` function on it with the information that we want to group our data hourly. We also have to decided what we want to do with the 

In [None]:
hourly_changes = change_log.resample("h").count()
hourly_changes.head()

In [None]:
hourly_changes['hour'] = hourly_changes.index.hour
hourly_changes.head()

In [None]:
changes_per_hour = hourly_changes.groupby("hour").sum()
changes_per_hour.head()

In [None]:
changes_per_hour.plot.bar();

## Display progressions

In [None]:
hourly_changes.head()

In [None]:
accumulated_changes = hourly_changes[['changes']].cumsum()
accumulated_changes.head()

In [None]:
accumulated_changes.plot();

## Grouping time and data
So far, we did group only on time-based data. But what if we want, e.g., group the weekly changes by each developer? Let's do this!

Once again, we read in the dataset that we already know. We only let pandas parse the timestamp information.

In [None]:
change_log = pd.read_csv("../datasets/change_history.csv", parse_dates=[0])
change_log.head() 

For this scenario, we also need some developers.

In [None]:
devs = pd.Series(["Alice", "Bob", "John", "Steve", "Yvonne"])
devs

Let's add some artificial ones to the changes and also mark each change with a separate column.

In [None]:
change_log['dev'] = devs.sample(len(change_log), replace=True).values
change_log['changes'] = 1
change_log.head()

OK, we want to group the changes per week per developer to find out the most active developer of the week (if this makes sense is up to you to find out ;-).

For this, we use `groupby` with a pandas `Grouper`. With the `Grouper`, we can say which column we want to group at which frequency (seconds, minutes, ... , years and so on). In our case: weekly. Additionally, we want to track which developer did how many weekly changes. So we take developers also in the list with the relevant information that should be grouped and sum up the changes accordingly.

In [None]:
weekly_changes_per_dev = \
    change_log.groupby([
        pd.Grouper(key='timestamp', freq='w'),
        'dev']) \
    .sum()
weekly_changes_per_dev.head()

This give as a Dataframe which lists the number of changes per week for each developers. We sort this list to get a kind of "most active developer per week list":

In [None]:
weekly_changes_per_dev.sort_values(
    by=['timestamp', 'changes'],
    ascending=[True, False])

## Summary

This Cheatbook guided you through several time series use cases. I hope you find this a good starting point for your own data analysis with time-based data!