# 1. Project info

**Project title**: Explore Linux's evolution with Pandas

**Name:** Markus Harrer

**E-mail:** datacamp@markusharrer.de

**GitHub username**: feststelltaste

**Short description**: Explore the Git repository history of the Linux kernel to find out about the development of the most famous open-source operating system.

#### Long description ####

Version control repositories like CVS, Subversion or Git store rich evolution information about a software project. In this project, you'll be challenged to read in, clean up and visualize a real world Git log dataset of the Linux kernel. With almost 700k commits and thousands of contributors (find out the exact number in this project ;-) ) there are many little data cleaning and wrangling challenges that you'll encounter. But you'll also gain insights about the contribution habits of the committers over the last 15 years.

For this Project, you need to be familiar with Pandas `DataFrame`, especially the `read_csv` and `groupby` functions as well as working with time series data.

#### Datasets used ####

We'll be using a plain Git log file of the [Linux kernel mirror on GitHub](https://github.com/torvalds/linux/).

#### Assumed student background ####

* Students will exploit the `DataFrame.read_csv()` function a little bit in this mini tutorial by using parameters that are very useful for processing header-less and compressed CSV files.
* Next, some basic data cleaning skills are necessary. There are wrong timestamps as well as missing values in the dataset. This requires a simple `fillna()` and a filtering based on `DateTimeIndex`es. There are also multiple author names for on person in the dataset. Students should spot and correct these data problems for the TOP 5 committers.
* Last, we `groupby()` the `DataFrame` by months and visualize the result via a `matplotlib` `bar` chart.

In general, this is an entry level analysis that includes also information about how to get a Git log from your repository.

# 2. Project narrative intro

## 1. Read in a Git log file

Version control repositories like CSV, Subversion or Git can be a real gold mine for software developers. They contain every change to the source code including the date (the "when"), the responsible developers (the "who") as well as a little message that describes the intention (the "what") of a change.

In this DataCamp Project, we take the first steps to get some insights into the evolution of a very famous open-source project &ndash; the Linux kernel. Let's dive quickly into the life of a Linux kernel developer by reading the introduction of the developers's guide ([source](https://github.com/torvalds/linux/blob/master/Documentation/process/1.Intro.rst#what-this-document-is-about)):

<img style="float: right;margin:5px 20px 5px 1px" src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/35/Tux.svg/204px-Tux.svg.png">

> The Linux kernel, at over 8 million lines of code and well over 1000 contributors to each release, is one of the largest and most active free software projects in existence. Since its humble beginning in 1991, this kernel has evolved into a best-of-breed operating system component which runs on pocket-sized digital music players, desktop PCs, the largest supercomputers in existence, and all types of systems in between. It is a robust, efficient, and scalable solution for almost any situation.

Linus Torvalds, the (spoiler alert!) main contributor to the Linux kernel (and also the creator of Git), created a [mirror of the Linux repository on GitHub](https://github.com/torvalds/linux/). It contains the complete history of kernel development for the last 15 years . We'll use this Git repository to get some first insights into the work of the development efforts by 

* identifying the TOP 10 committers and
* visualizing the commits over time.


**The Dataset**

I cloned the whole Git repository (~3.6 GB) from GitHub for you, exported the history of the relevant information into a text file with the command

```bash
git log --pretty="%at#%aN#%aE" > git_log_basic.log
```  

and compressed it with `bzip2`. In the `./datasets` directory, you'll find the `bz2`-compressed version of this file named `git_log_basic.bz2`. It contains information about every single code contribution (a "commit") to the Linux kernel over the last 15 years.

The entries of the (header-less) dataset look like this:

```
1501675595#Yoshihiro Shimoda#yoshihiro.shimoda.uh@renesas.com
1500464230#Manu Gautam#mgautam@codeaurora.org
1500945088#Shawn Lin#shawn.lin@rock-chips.com
1501077766#Ludovic Desroches#ludovic.desroches@microchip.com
1499173119#Andy Shevchenko#andriy.shevchenko@linux.intel.com
```

Each line consists of some basic information of a commit:
* `timestamp`: the time of the commit as a UNIX timestamp in seconds (Git log placeholder "`%at`")
* `author`: the name of the author that performed the commit (Git log placeholder "`%aN`")
* `email`: the author's email address (Git log placeholder "`%aE`")

The columns are separated by the number sign `#`.

Let's read in the Git log file with Pandas and the hints from above!

In [1]:
import pandas as pd

FILE_PATH = "datasets/git_log_basic.bz2"

log_raw = pd.read_csv(
    FILE_PATH,
    sep="#",
    compression="bz2", #optional
    header=None,
    names=['timestamp', 'author', 'email']
)

log_raw.head()

Unnamed: 0,timestamp,author,email
0,1502826583,Linus Torvalds,torvalds@linux-foundation.org
1,1502741399,Linus Torvalds,torvalds@linux-foundation.org
2,1502735756,Linus Torvalds,torvalds@linux-foundation.org
3,1502665292,Linus Torvalds,torvalds@linux-foundation.org
4,1502663668,Linus Torvalds,torvalds@linux-foundation.org


## 2. Get familiar with the data

Let's get familiar with the data we are talking about by going through some metrics.

In [2]:
# How many commits are we looking at?
print("\n--- # commits ---")
print(len(log_raw))


# Which are the TOP 10 authors of the (still dirty) dataset?
print("\n--- TOP 10 committers ---")
print(log_raw['author'].value_counts().head(10))


# How many authors contributed to the Linux kernel so far?
print("\n--- # contributors ---")
print(len(log_raw['author'].value_counts()))


--- # commits ---
692885

--- TOP 10 committers ---
Linus Torvalds           23361
David S. Miller           8994
Mark Brown                6796
Takashi Iwai              6206
Al Viro                   5993
H Hartley Sweeten         5931
Ingo Molnar               5324
Mauro Carvalho Chehab     5172
Arnd Bergmann             4818
Greg Kroah-Hartman        4556
Name: author, dtype: int64

--- # contributors ---
17258


## 3. Wrangling the data

We've got some impressive numbers, haven't we? You did realize how fast your computer processed the data? That's number crunching at it's best :-)


OK, let's prepare our `DataFrame` for some time series analysis. Rearrange the existing `log_raw` `DataFrame` into a `log_timed` `DataFrame` that uses the `timestamp` data (which has "seconds" as unit) as `DatetimeIndex`. Take a look at the results of `index`'s `summary()`, too. Is there anything odd?

In [3]:
log_raw['timestamp'] = pd.to_datetime(log_raw['timestamp'], unit="s")
log_timed = log_raw.set_index('timestamp').sort_index()
log_timed.index.summary()

'DatetimeIndex: 692885 entries, 1970-01-01 00:00:01 to 2037-04-25 08:08:26'

*Stop here! Only the three first tasks :)*