# Software Analytics Mini Tutorial Part III: Working with Time Series

## Introduction
This series of notebooks are a simple mini tutorial to introduce you to the basic functionality of Jupyter, Python, pandas and matplotlib. The comprehensive explanations should guide you to be able to analyze software data on your own. Therefore, the examples is chosen in such a way that we come across the typical methods in a data analysis. Have fun!

*This is part III: The basic of the time series analysis with pandas.*

## The analysis goal
In this mini tutorial, we want to find out for the Linux kernel software project 
* Question 1: At what hour of the day the commits are made?
* Question 2: At which weekday commits occur?
* Question 3: How the daily progress of the development was?

## The dataset

#### Import dataset
1. import pandas
1. read in `../datasets/git_log_linux_authors_timestamps.gz` into `log`.
1. display the first five entries

## Working with time-based data
Now let's look at the timestamp information.

### Question 1: Daytime of Commits

First, we want to find out at what time of day the developers commit.

#### View timestamp column
1. display the first five entries of the series `timestamp`.

#### Convert to the Timestamp data type

The data type of the `timestamp` series is still* `object` (because it's plain text from the CSV file). To work with the time capabilities of pandas, we need to convert this data to pandas', we need a `Timestamp` data type.


1. use the pandas function `pd.to_datetime` to convert the column `timestamp` into a real date data type.
1. write the result into the new variable `ts` (abbreviation for "timestamp")*.
1. output the first five entries.

**Note: We could have also overridden the `timestamp` series. But to show some pandas functionality, we use a separate Series.*

##### Discussion
* Was the conversion successful? 
* What could possibly happen during the conversion?
* How could you treat issues that occur?

#### Work with hourly data
1. access the date object `dt` of the Series `ts`.
2. inspect the hours of the `hour` property of the `dt` object.
3. store the hours into a new Series `hour`

#### Add a new Series to a DataFrame

1. Add the `hour` Series to the existing `log` DataFrame with the syntax

```python
<DataFrame>['<new Series name>'] = <Series>
```
2. print out the first five entries of `log`.

#### Find out the favorite commit times
1. sum up the number of commits (=rows) for each hour
1. sort the index with `sort_index()` to list the hours in an ascending order
1. save the result in `commits_per_hour`
1. display the first five entries

#### Visualize the hourly commit result
1. plot a bar chart of the hourly commit counts.

#### Enhance the visualization

We store the return object of the `bar()` function in the variable `ax`. This is an `Axes` object of the underlying plotting library `matplotlib`, through which we can customize additional properties of the plot.

1. add the title "Commits per Hour" via `set_title("<titel name>")`
1. add the label "Hour of Day" of the X-axis with `set_xlabel("<X-axis label name>")`
1. add the label "Number of Commits" of the Y-axis with `set_ylabel(<"Y-axis label name>")`

##### Discussion

* What do you find interesting?
* What could be differences compared to the commit distribution for software systems within a company?
* What is wrong with this approach of counting all the working hours per day?

#### Reworking the time-based analysis
1. To address some of the issues above, we first take a subgroup of the commits first with `sample(50)` on the `log` DataFrame and put it into a `sample` variable.

#### Working with time-based data
1. Convert the `timestamp` column to a read `datetime` data format.
1. Make the `timestamp` column the new index with the `set_index("<column_name>")` method.
1. Store the newly indexed DataFrame in `timed_sample`.

#### Resampling time series
1. create a new DataFrame `hourly_commits` that counts all commits in an hour for all commits using `resample()`.

#### Get hour of day
1. Take the `hour` property of the index object and put it into the new column `hour_of_day`.

#### Sum up commits per hour of day
1. use `groupby("<column_name>")` with `sum()` to sum all commits that occured at the same hour of the say
1. store the result in a `commits_per_hour_of_day` DataFrame

#### Visualized the results
1. plot a bar chart for the result (you just need one column for this)

##### Discussion

* What's the difference?

### Question 2: Commits per weekday

We can also analyze the commits per weekdays.

#### Retrieve the weekday number

1. store the weekday number into an additional Series `weekday` in `log` by using the `dayofweek` property of `dt` from the `ts` Series
1. store the weekday name into an additional Series `weekday_name` in `log` by using the `day_name()` method of `dt` from the `ts` Series
1. display the first five entries

*Note: The `ts` Series and the `log` DataFrame fit nicely together because the have the same so-called* `index` *(the mechanism that takes care of the order in a DataFrame or Series).*

#### Group data

Now it's getting tricky. We need to group the values by the `weekday_name` day names but have also to keep the the order of the data with `weekday`.

1. use `groupby` from the `log` DataFrame to group the list of Series `['weekday', 'weekday_name']`.
1. select just the `timestamp` Series as remaining data
1. aggregate the timestamp values with the `count()` method
1. display all results

#### Partially reset the index

Now we need to just keep the ordered `weekday_name` as index.

1. use the `reset_index` method with `level=0` as parameter to move the first index level back to a normal Series
1. store the new indexed DataFrame into the `commits_per_day_name`
1. display all results

#### Plot the commits per weekday

1. plot the Series `timestamp` of `commits_per_day_name` as bar chart.

### Question 3: Overall daily progress

In the following, we want to see the progress of the number of commits over the last years by using a `DatetimeIndex` based DataFrame. We want to aggregate the number of commits for each day.

#### Create a time-based DataFrame

1. set the `ts` Series from above as index using `set_index(<Series>)`
1. just select the `author` column (we just need one column for the counting later on)
1. store the result in `log_timed`
1. display the five first entries

#### Resample the dataset

1. use the `resample("<time unit>")` function of `log_timed` to group values on a daily basis using `"D"´*
1. count the values per day using the `count()` method
1. store the result in `commits_per_day`
1. display the first five entries

**Other time units are for example months (`M`), quarters (`Q`) or years (`A`).*

#### Calculate the progress

To show the commit history over time, we calculate the cumulative sum of all daily entries. This sums up all values one after another.

1. use `cumsum()` on `commits_per_day`
1. store the result in `commits_per_day_cumulative`
1. display the first five results



#### Plot the results

1 . plot the results of the accumulated values as a line chart.

# Summary

You've learned some basics about pandas and its usage with software data. This will hopefully get you started in your daily work. Other important topics that are still missing are:

* Reading in complicated or semi-structured data structures
* Merging different data sources with `pd.merge` and `join`
* Transforming data with `pivot_table`

But with the features shown, you are on a good way to become a Software Development Analyst that can leverage Data Science to fix problems in software systems!

If you want to dive deeper into this topic, take a look at my [other blog posts on that topic](http://www.feststelltaste.de/category/software-analytics/). I'm also [offering training for companies](http://markusharrer.de/) who want to fix their problems using data analysis in software development.

If you want to dive deeper into this topic, take a look at my [blog posts on that topic](http://www.feststelltaste.de/category/software-analytics/) or my microsite [softwareanalytics.de](https://softwareanalytics.de/). I'm looking forward to your comments and feedback on [GitHub](https://github.com/feststelltaste/software-analytics-workshop/issues) or on [Twitter](https://www.twitter.com/feststelltaste)!