# Basic data handling

First, we import pandas as `pd` for short.

In [1]:
import pandas as pd

One basis data type in padas is a Series (kind of an indexed list of values)

## Main pandas' data structures

In [2]:
series = pd.Series(["A","B", "C"])
series

0    A
1    B
2    C
dtype: object

A series has an index (in may cases just a list of number in an ascending order)

In [3]:
index = series.index
index

RangeIndex(start=0, stop=3, step=1)

A DataFrame is a colletion of Series. We can create a DataFram from a Series.

In [4]:
dataframe = pd.DataFrame(series, columns=["char"])
dataframe

Unnamed: 0,char
0,A
1,B
2,C


 We can add an additional Series like this.

In [5]:
dataframe['num'] = pd.Series([1,2,3])
dataframe

Unnamed: 0,char,num
0,A,1
1,B,2
2,C,3


On string values, we can apply various methods.

In [6]:
# this code is just for demonstration purposes and not needed in an analysis
[x for x in dir(dataframe['char'].str) if not x.startswith("_")]

['capitalize',
 'casefold',
 'cat',
 'center',
 'contains',
 'count',
 'decode',
 'encode',
 'endswith',
 'extract',
 'extractall',
 'find',
 'findall',
 'get',
 'get_dummies',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'islower',
 'isnumeric',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'len',
 'ljust',
 'lower',
 'lstrip',
 'match',
 'normalize',
 'pad',
 'partition',
 'repeat',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'slice',
 'slice_replace',
 'split',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'wrap',
 'zfill']

E.g. we could set the values in the `char` series to lower case with `lower()`.

In [7]:
dataframe['char'].str.lower()

0    a
1    b
2    c
Name: char, dtype: object

## Converting data


In [8]:
data = pd.DataFrame(["1", 10, None, "3.0", "hi"], columns=["num"])
data

Unnamed: 0,num
0,1
1,10
2,
3,3.0
4,hi


In [9]:
number_data = pd.to_numeric(data['num'], errors="coerce")
number_data

0     1.0
1    10.0
2     NaN
3     3.0
4     NaN
Name: num, dtype: float64

## Filtering data

### Removing nulls

In [10]:
numbers = number_data.dropna()
numbers

0     1.0
1    10.0
3     3.0
Name: num, dtype: float64

### Selecting data

In [11]:
is_one_digit = (numbers < 10) & (numbers >= 0)
is_one_digit

0     True
1    False
3     True
Name: num, dtype: bool

In [12]:
one_digit_numbers = numbers[is_one_digit]
one_digit_numbers

0    1.0
3    3.0
Name: num, dtype: float64

In [26]:
other_numbers = numbers[~is_one_digit]
other_numbers

1    10.0
Name: num, dtype: float64

## Stacking and unstacking data

In [13]:
teams = pd.DataFrame({
    "team" : ["A", "A", "B"],
    "name": ["Kevin", "Phillip", "Mike"]
    })
teams

Unnamed: 0,team,name
0,A,Kevin
1,A,Phillip
2,B,Mike


In [14]:
teams.stack()

0  team          A
   name      Kevin
1  team          A
   name    Phillip
2  team          B
   name       Mike
dtype: object

In [15]:
teams.unstack()

team  0          A
      1          A
      2          B
name  0      Kevin
      1    Phillip
      2       Mike
dtype: object

In [16]:
teams.unstack().unstack()

Unnamed: 0,0,1,2
team,A,A,B
name,Kevin,Phillip,Mike


## Pivoting data

In [17]:
teams['working_hours'] = [30,20,40]
teams.head()

Unnamed: 0,team,name,working_hours
0,A,Kevin,30
1,A,Phillip,20
2,B,Mike,40


In [18]:
hours_per_team = teams.pivot_table("working_hours", "name", "team", fill_value=0)
hours_per_team

team,A,B
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Kevin,30,0
Mike,0,40
Phillip,20,0


In [19]:
hours_per_team.sum()

team
A    50
B    40
dtype: int64

## Reading data

In [20]:
changes = pd.read_csv("datasets/change_history.csv")
changes.head()

Unnamed: 0,timestamp
0,2017-01-01 11:11:39
1,2017-01-01 13:18:26
2,2017-01-01 16:01:37
3,2017-01-01 19:02:45
4,2017-01-01 20:47:01


## Saving data

In [21]:
changes.to_csv("/tmp/mychanges.csv")
pd.read_csv("/tmp/mychanges.csv").head()

Unnamed: 0.1,Unnamed: 0,timestamp
0,0,2017-01-01 11:11:39
1,1,2017-01-01 13:18:26
2,2,2017-01-01 16:01:37
3,3,2017-01-01 19:02:45
4,4,2017-01-01 20:47:01


In [22]:
changes.to_csv("/tmp/mychanges.csv", index=None)
pd.read_csv("/tmp/mychanges.csv").head()

Unnamed: 0,timestamp
0,2017-01-01 11:11:39
1,2017-01-01 13:18:26
2,2017-01-01 16:01:37
3,2017-01-01 19:02:45
4,2017-01-01 20:47:01


## Joining data

In [23]:
commits = pd.DataFrame({
    "commit_id" : ["twq3", "23ae", "aead", "hqd2", "fg3d"],
    "author": ["Kevin", "Phillip", "Mike", "Kevin", "Mike"]})
commits

Unnamed: 0,commit_id,author
0,twq3,Kevin
1,23ae,Phillip
2,aead,Mike
3,hqd2,Kevin
4,fg3d,Mike


For the data that we want to join, we set the series that fits the information in the other DataFrame as index.

In [24]:
name_teams = teams.set_index("name")
name_teams

Unnamed: 0_level_0,team,working_hours
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Kevin,A,30
Phillip,A,20
Mike,B,40


In [25]:
teams_commits = commits.join(name_teams, on="author")
teams_commits

Unnamed: 0,commit_id,author,team,working_hours
0,twq3,Kevin,A,30
1,23ae,Phillip,A,20
2,aead,Mike,B,40
3,hqd2,Kevin,A,30
4,fg3d,Mike,B,40
