# Pandas basics
This Cheatbook guides you through basic operations and mechanics of pandas.

## References
* [Pandas' website](https://pandas.pydata.org/)
* [Pandas' user guide](https://pandas.pydata.org/docs/user_guide/index.html)

First, we import pandas as `pd` for short. This allows us to use pandas under the abbreviation `pd`, which saves use some keystrokes.

In [1]:
import pandas as pd

## Main pandas' data structures

One base data type in pandas is a Series or columns. It is a list of obervations (= entries) for one variable (= characteristic).

In [2]:
series = pd.Series(["A", "B", "C", "D"])
series

0    A
1    B
2    C
3    D
dtype: object

We can give the Series a name.

In [3]:
series.name = "letter"
series

0    A
1    B
2    C
3    D
Name: letter, dtype: object

A series has an index (in many cases just a list of number in an ascending order) on the left side which we can access as well.

In [4]:
index = series.index
index

RangeIndex(start=0, stop=4, step=1)

A DataFrame is a colletion of Series. We can create a DataFrame from our Series.

In [5]:
dataframe = pd.DataFrame(series)
dataframe

Unnamed: 0,letter
0,A
1,B
2,C
3,D


 We can add an additional Series.

In [6]:
dataframe['a'] = pd.Series([1,2,3])
dataframe['b'] = pd.Series([4,4,4])
dataframe

Unnamed: 0,letter,a,b
0,A,1.0,4.0
1,B,2.0,4.0
2,C,3.0,4.0
3,D,,


### Accessing values

Accessing a Series

In [7]:
dataframe['letter']

0    A
1    B
2    C
3    D
Name: letter, dtype: object

Accessing multiple Series.

In [8]:
dataframe[['a', 'b']]

Unnamed: 0,a,b
0,1.0,4.0
1,2.0,4.0
2,3.0,4.0
3,,


Accessing a row by an integer-based index.

In [9]:
dataframe.iloc[0]

letter    A
a         1
b         4
Name: 0, dtype: object

Accessing multiple rows by the given index.

In [10]:
dataframe.loc[[0,2]]

Unnamed: 0,letter,a,b
0,A,1.0,4.0
2,C,3.0,4.0


Displaying onle the first (two) rows of a DataFrame.

In [11]:
dataframe.head(2)

Unnamed: 0,letter,a,b
0,A,1.0,4.0
1,B,2.0,4.0


## Working with values

We can use basic Python-like operations on the Series in a DataFrame.

In [12]:
dataframe['a'] * dataframe['b']

0     4.0
1     8.0
2    12.0
3     NaN
dtype: float64

If we convert between datatypes, we can also work with them.

In [13]:
dataframe['letter'] + dataframe['a'].astype(str)

0    A1.0
1    B2.0
2    C3.0
3    Dnan
dtype: object

On string values, we can apply various methods.

In [14]:
# this code is just for demonstration purposes and not needed in an analysis
[x for x in dir(dataframe['letter'].str) if not x.startswith("_")]

['capitalize',
 'casefold',
 'cat',
 'center',
 'contains',
 'count',
 'decode',
 'encode',
 'endswith',
 'extract',
 'extractall',
 'find',
 'findall',
 'get',
 'get_dummies',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'islower',
 'isnumeric',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'len',
 'ljust',
 'lower',
 'lstrip',
 'match',
 'normalize',
 'pad',
 'partition',
 'repeat',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'slice',
 'slice_replace',
 'split',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'wrap',
 'zfill']

E.g. we could set the values in the `char` series to lower case with `lower()`.

In [15]:
dataframe['letter'].str.lower()

0    a
1    b
2    c
3    d
Name: letter, dtype: object

## Converting data


In [16]:
data = pd.DataFrame(
    ["1", 10, None, "3.0", "hi"],
    columns=["num"])
data

Unnamed: 0,num
0,1
1,10
2,
3,3.0
4,hi


Converting data to numeric while  

In [17]:
number_data = pd.to_numeric(data['num'], errors="coerce")
number_data

0     1.0
1    10.0
2     NaN
3     3.0
4     NaN
Name: num, dtype: float64

## Filtering data

### Removing nulls

With `dropna()`, we remove any entries that have `NaN` values.

In [18]:
numbers = number_data.dropna()
numbers

0     1.0
1    10.0
3     3.0
Name: num, dtype: float64

### Selecting data

We can create a condition that is true, if the entries fullfil the condition.

In [19]:
is_one_digit = (numbers < 10) & (numbers >= 0)
is_one_digit

0     True
1    False
3     True
Name: num, dtype: bool

We can then take this result as input for selecting data that fulfills that condition.

In [20]:
one_digit_numbers = numbers[is_one_digit]
one_digit_numbers

0    1.0
3    3.0
Name: num, dtype: float64

We can also negate the condition with the `~` sign.

In [21]:
other_numbers = numbers[~is_one_digit]
other_numbers

1    10.0
Name: num, dtype: float64

## Stacking and unstacking data

We can transform data.

In [22]:
teams = pd.DataFrame({
    "team" : ["A", "A", "B"],
    "name": ["Kevin", "Phillip", "Mike"]
    })
teams

Unnamed: 0,team,name
0,A,Kevin
1,A,Phillip
2,B,Mike


`stack()` put data in various columns on top of each other.

In [23]:
teams.stack()

0  team          A
   name      Kevin
1  team          A
   name    Phillip
2  team          B
   name       Mike
dtype: object

`unstack()` "pulls apart" the DataFrame.

In [24]:
teams.unstack()

team  0          A
      1          A
      2          B
name  0      Kevin
      1    Phillip
      2       Mike
dtype: object

We can unstack multiple times until everything is widened.

In [25]:
teams.unstack().unstack()

Unnamed: 0,0,1,2
team,A,A,B
name,Kevin,Phillip,Mike


## Pivoting data

Let's say we have more data in our DataFrame.

In [26]:
teams['working_hours'] = [30,20,40]
teams.head()

Unnamed: 0,team,name,working_hours
0,A,Kevin,30
1,A,Phillip,20
2,B,Mike,40


With `pivot_table`, we can transform the data to get another view on it.

In [27]:
hours_per_team = teams.pivot_table("working_hours", "name", "team", fill_value=0)
hours_per_team

team,A,B
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Kevin,30,0
Mike,0,40
Phillip,20,0


This is helpful if we want to calculate results for specific characteristics of our data.

In [28]:
hours_per_team.sum()

team
A    50
B    40
dtype: int64

## Reading data

Pandas can read various data sources. A common format is CSV (comma separated values). The `read_csv` funktion can read this data.

In [29]:
changes = pd.read_csv("datasets/change_history.csv")
changes.head()

Unnamed: 0,timestamp
0,2017-01-01 11:11:39
1,2017-01-01 13:18:26
2,2017-01-01 16:01:37
3,2017-01-01 19:02:45
4,2017-01-01 20:47:01


## Saving data

Pandas can save DataFrames and Series in various ways, too.

In [30]:
changes.to_csv("/tmp/mychanges.csv")
pd.read_csv("/tmp/mychanges.csv").head()

Unnamed: 0.1,Unnamed: 0,timestamp
0,0,2017-01-01 11:11:39
1,1,2017-01-01 13:18:26
2,2,2017-01-01 16:01:37
3,3,2017-01-01 19:02:45
4,4,2017-01-01 20:47:01


Setting the `index` parameter to `None` avoids exporting the index.

In [31]:
changes.to_csv("/tmp/mychanges.csv", index=None)
pd.read_csv("/tmp/mychanges.csv").head()

Unnamed: 0,timestamp
0,2017-01-01 11:11:39
1,2017-01-01 13:18:26
2,2017-01-01 16:01:37
3,2017-01-01 19:02:45
4,2017-01-01 20:47:01


## Joining data

Datasets can also be joined. Let's take a look at this, too.

In [32]:
commits = pd.DataFrame({
    "commit_id" : ["twq3", "23ae", "aead", "hqd2", "fg3d"],
    "author": ["Kevin", "Phillip", "Mike", "Kevin", "Mike"]})
commits

Unnamed: 0,commit_id,author
0,twq3,Kevin
1,23ae,Phillip
2,aead,Mike
3,hqd2,Kevin
4,fg3d,Mike


For the data that we want to join, we set the series that fits the information in the other DataFrame as index.

In [33]:
name_teams = teams.set_index("name")
name_teams

Unnamed: 0_level_0,team,working_hours
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Kevin,A,30
Phillip,A,20
Mike,B,40


We can then join the former dataset with the other one that has the same values in a specific columns (here: `author`).

In [34]:
teams_commits = commits.join(name_teams, on="author")
teams_commits

Unnamed: 0,commit_id,author,team,working_hours
0,twq3,Kevin,A,30
1,23ae,Phillip,A,20
2,aead,Mike,B,40
3,hqd2,Kevin,A,30
4,fg3d,Mike,B,40


## Summary
I hope you find this useful! Have fun with Pandas and take a look at the other Cheatbooks as well!