# Introduction
This notebook is a simple mini-tutorial to introduce you to basic functions of Jupyter, Python, pandas and matplotlib with the aim of analyzing software data. Therefore, the example is chosen in such a way that we come across the typical methods in a data analysis. Have fun!

*This is part II: The basic of the data analysis framework pandas. For Jupyter and Python basics, got to [00_jupyter_python_basics.ipynb](00_jupyter_python_basics.ipynb).


# Data Analysis with pandas
OK, let's start learning pandas with a little analysis!

In this notebook, we want to take a closer look at the development history of the open source project "Linux" based on the history of the corresponding GitHub mirror repository.

A local clone of the GitHub repository https://github.com/torvalds/linux/ was created by using the command  

```
git clone https://github.com/torvalds/linux.git
```

The relevant parts of the history for this analysis were produced by using

```
git log --pretty="%ad,%aN" --no-merges > git_log_linux_authors_timestamps.csv
```

This command returned the commit timestamp (`%ad`) and the author name (`%aN`) for each commit of the Git repository. The corresponding values are separated by commas. We also indicated that we do not want to receive merge commits (via `--no-merges`). The result of the output was saved in the file `git_log_linux_authors_timestamps.csv` and compressed for a optimized file size with `gzip` to the file `git_log_linux_authors_timestamps.gz`.

_Note: For an optimized demo, headers and the separator has been changed manually in the provided dataset to get through this analysis more easily. The differences can be seen at https://www.feststelltaste.de/developers-habits-linux-edition/, which was done with the original dataset._

# Getting to know pandas
Pandas is a data analysis tool written in Python (and C), which is ideally suited for the evaluation of tabular data due to the use of effective data structures and built-in statistics functions.

## Basics

We import the data from above with the help of Pandas. We import `pandas` with the common abbreviation `pd` using the `import... as..` syntax of Python.

We can check whether the import of the module really worked by checking the documentation of the `pd` module. To do this, we append the `?` operator to the `pd` variable and execute the cell. The documentation of the module appears in the lower part of the browser window. We can read through this area and make it disappear again with the `ESC` key.

We read the compressed CSV file `git_log_linux_authors_timestamps.gz` in the directory `../datasets` with the `read_csv()` method into a DataFrame.

The result of our execution is stored in the variable `git_log`. We've just loaded data into a so-called **DataFrame** (something like a programmable Excel worksheet), which in our case consists of two **Series** (= columns). 

We can now perform operations on the DataFrame. For example, we can use `head()` to display the first five entries.

Next, we call `info()` on the `DataFrame` to get some basic data about the read in data.

We can access the individual Series / columns by using the `['<column name>']` or (in most cases, i.e. as long as the column names do not overlap with the method name offered by the `DataFrame` itself) by directly using the name of the `Series`.

## First Analysis

We can also perform various operations on a `Series`. For example, with `value_counts()`, we can count the values contained in a `Series` and let them sort according to their frequency. The result is again a `Series`, but this time with the totaled and sorted values. We can additionally call `head(10)` on this `Series`. This gives us a quick way to display the TOP-10 values of a `Series`. We can then record the result in a variable `top10` and output it by writing the variable to the next cell row.

## First visualizations
Next, we want to visualize or plot the result. To display the plotting result of the internally used plotting library `matplotlib` directly in the notebook, we have to execute this magic command in our notebook

```
%matplotlib inline
```

before calling the `plot()` method.

By default, when `plot()` is called on a `DataFrame` or `Series`, a line chart is created.

That doesn't make much sense here, so we use a sub-method of `plot` called `bar()` to create a bar chart.

This data can also be visualized as a pie chart. For this, we call the `pie()` method instead of `bar()`. We can also add a semicolon `;` after the `plot` command to avoid printing the text of the reference. 

However, the diagram does not look very nice here.

With the optional styling parameters, we can achieve that we get a nicer graphics. We use
* `figsize=[7,7]` as size
* `title="Top 10 authors"` as title
* `labels=None` to avoid displaying the superfluous label on the left.

## Working with dates
Now let's look at the timestamp information. We want to find out at what time of day the developers commit.

Before we can enter the world of time series processing, we must first convert our column with the dates into the appropriate data type. At the moment our column `timestamp` is still a string, i.e. of textual nature. We can see this by using the helper function `type(<object>)` to display the first entry of the `timestamp` column:

Of course, Pandas also helps us to convert data types. The function `pd.to_datetime` takes as first parameter a `series` with dates and converts them. The return value is a `Series` with values of the data type `Timestamp`. The conversion works for most textual dates mostly automagically [sic!], because Pandas can handle different date formats. We also write the result back into the same column.

To check if the conversion was successful, we can check the first value of our converted column `timestamp_local` by calling `type()` again.

We can now also access individual parts of the date values. For this, we use the `dt` ("datetime") object with its properties like `hour`.

Together with the `value_counts()` method that I've already introduced above, we can now count values again after their occurrence. However, it is important that we also set the parameter `sort=False` to avoid sorting according to keep the order of the hours.

We can display the result by means of a bar chart and thus get an overview of how many commits occured for each hour.

We now additionally label the plot. To do this, we store the return object of the `bar()` function in the variable `ax`. This is an `Axes` object of the underlying plotting library `matplotlib`, through which we can customize additional properties of the plot. We set here

* the title "Commits per Hour" via `set_title("<titel name>")`
* the label "Hour of Day" of the X-axis with `set_xlabel("<X-axis label>")`
* the label "Number of Commits" of the Y-axis with `set_ylabel<"Y-axis label>")`

The result is a more meaningful, labeled bar chart.

We can also analyze the commits per weekdays. To do this, we use the `weekday` attribute of the datetime attribute `dt`. The values here are 0-based with Monday as the first day of the week. As usual, we count the values using `value_counts` and do not sort the values by size but keep the sorting by weekday.

The result in `commits_per_weekday` can be output as a bar chart using `plot.bar()`.

## Displaying the commit history
In the following, we want to see the progress of the number of commits over the last years by using a `DatetimeIndex` based DataFrame. To do this, we set the `timestamp` column as index using `set_index('<columnname>')`. Furthermore, we select just the `author` column. Thus we work continuously on a pure `Series` instead of a `DataFrame`. 

Side note: The usage of a `Series` is almost similar to a `DataFrame` with regard to the statistical functions. However, a `Series` is not beautifully formatted in a table, which is why I personally prefer using a `DataFrame`.

Using the `resample("<time unit>")` function of the `DataFrame`, we can now group values according to certain time units such as days (`D`), months (`M`), quarters (`Q`) or years (`A`). We use a `resample("D")` for counting per day. We also specify how the individual values should be combined per time unit. For this, we select the `count()` function to count the number of commits for each day.

To show the commit history over the years, we calculate the cumulative sum of all daily entries using `cumsum()`. All values get summed up one after the other.

The result is plotted as a line diagram and the number of commits is shown over the years.

# What you've learned
OK, that's it for now!

You've learned some basics about Pandas and its usage with software data. This will hopefully get you started in your daily work. Other important topics that are still missing are:

* Reading in complicated or semi-structured data structures
* Merging different data sources with `pd.merge` and `join`
* Grouping similar data using `groupby`
* Transforming data with `pivot_table`

But with the features shown, you are on a good way to become a Software Development Analyst that can leverage Data Science to fix problems in software systems!

If you want to dive deeper into this topic, take a look at my [other blog posts on that topic](http://www.feststelltaste.de/category/software-analytics/). I'm also [offering training for companies](http://markusharrer.de/) who want to fix their problems using data analysis in software development.

I'm looking forward to your comments and feedback!