# So you want to be a Pandas expert

The core concept of Pandas is the index, and index alignment.

## Why use Pandas?

Let's talk about, why Pandas in the first place?
Why do we even bother with Pandas?
Because, in my opinion, Pandas is a very one-dimensional tool, there is basically only one reason to use it (and there is a pun in there too).
And if you think of all the different reasons you might have landed on learning pandas, there are a lot of reasons that might come up, but they are not very good reasons.

You might want the use Pandas because you like the Python ```dict``` and ```list```, and you might want something even less convenient, and even more perplexing.

### Pandas Series

For instance, here is this thing called a Pandas series.

In [78]:
from pandas import Series, DataFrame
from numpy.random import default_rng

rng = default_rng(0)

s = Series(rng.integers(-10, +10, size=5))

print(s)

0    7
1    2
2    0
3   -5
4   -4
dtype: int64


It prints a little bit differently than a list, and we can iterate over the contents of it using ```iteritems```:

In [7]:
for x in s.items(): print(f"{x=}")

x=(0, 7)
x=(1, 2)
x=(2, 0)
x=(3, -5)
x=(4, -4)


We get the numbering of the rows, i don't know why that's useful, so maybe we'll just unpack that into a variable that you don't use:

In [8]:
for _, x in s.items(): print(f"{x=}")

x=7
x=2
x=0
x=-5
x=-4


or we can iterate over this directly, but I can't see what this is giving us over a list:

In [10]:
for x in s: print(f"{x=}")

x=7
x=2
x=0
x=-5
x=-4


Or, more likely, if we are using Pandas, we are more familiar with the Pandas DataFrame:

In [19]:
df = DataFrame(rng.integers(-10, +10, size=(5,3)), columns=[*"abc"])

In [20]:
df

Unnamed: 0,a,b,c
0,4,2,0
1,1,8,-5
2,6,3,-10
3,-3,7,1
4,-10,5,4


This looks kind of a dictionary of Python lists, or maybe a bunch of lists stacked next to each other, or maybe some sort of matrix.

And, again, we can iterate over this thing, but then we get something really weird:

In [21]:
for x in df: print(f"{x=}")

x='a'
x='b'
x='c'


We get the column names. How is that useful to anybody?

So I gues we could try ```iterrows```:

In [22]:
for x in df.iterrows(): print(f"{x=}")

x=(0, a    4
b    2
c    0
Name: 0, dtype: int64)
x=(1, a    1
b    8
c   -5
Name: 1, dtype: int64)
x=(2, a     6
b     3
c   -10
Name: 2, dtype: int64)
x=(3, a   -3
b    7
c    1
Name: 3, dtype: int64)
x=(4, a   -10
b     5
c     4
Name: 4, dtype: int64)


But I have no idea what this is giving me, some row and each of the columns.

Or ```iteritems```:

In [24]:
for x in df.items(): print(f"{x=}")

x=('a', 0     4
1     1
2     6
3    -3
4   -10
Name: a, dtype: int64)
x=('b', 0    2
1    8
2    3
3    7
4    5
Name: b, dtype: int64)
x=('c', 0     0
1    -5
2   -10
3     1
4     4
Name: c, dtype: int64)


That is kind of like a dictionary, right? This gives me like each of the columns independently. I guess I can throw away the column name and get the column:

In [25]:
for _, x in df.items(): print(f"{x=}")

x=0     4
1     1
2     6
3    -3
4   -10
Name: a, dtype: int64
x=0    2
1    8
2    3
3    7
4    5
Name: b, dtype: int64
x=0     0
1    -5
2   -10
3     1
4     4
Name: c, dtype: int64


I don't know what's that.

Or I can do ```itertuples```

In [26]:
for x in df.itertuples(): print(f"{x=}")

x=Pandas(Index=0, a=4, b=2, c=0)
x=Pandas(Index=1, a=1, b=8, c=-5)
x=Pandas(Index=2, a=6, b=3, c=-10)
x=Pandas(Index=3, a=-3, b=7, c=1)
x=Pandas(Index=4, a=-10, b=5, c=4)


There I can start getting close to something I can deal with: tuples, lists, dictionaries, I understand those things.

Because, if you compare a Pandas DataFrame or Series to the built-in datatypes in Python, the Python list, the Python set, the Python dictionary, I'd say, in terms of strsightforwardness, the built-in types really went out.

Here we have python list, and it's just a bunch of numbers:

In [33]:
from random import seed; seed(0)
from random import choice, randint
from string import ascii_lowercase

In [41]:
xs = [randint(-10, +10) for _ in range(10)]

In [44]:
xs

[2, -1, 5, 1, 8, -4, 6, -6, -1, -6]

A structure we iterete over, and we can get the numbers and do something:

In [45]:
for x in xs: print(f"{x=}")

x=2
x=-1
x=5
x=1
x=8
x=-4
x=6
x=-6
x=-1
x=-6


We also have the dictionary, as set of key value pairs:

In [50]:
d = {choice(ascii_lowercase): randint(-10, +10) for _ in range(10)}

In [51]:
d

{'r': 5, 'o': 6, 'i': -9, 'z': 7, 'a': 9, 'x': 2, 'w': 10, 'p': 0, 'h': 0}

In [None]:
We can also iterate over the keys:

In [54]:
for k in d: print(f"{k=}")

k='r'
k='o'
k='i'
k='z'
k='a'
k='x'
k='w'
k='p'
k='h'


In [None]:
Which we can also do explicitly:

In [56]:
for k in d.keys(): print(f"{k=}")

k='r'
k='o'
k='i'
k='z'
k='a'
k='x'
k='w'
k='p'
k='h'


In [None]:
We can also go over the individual values:

In [57]:
for v in d.values(): print(f"{v=}")

v=5
v=6
v=-9
v=7
v=9
v=2
v=10
v=0
v=0


We can also go over the pairing of the keys and the values; that seems also a straightforward and very simple api:

In [61]:
for k,v in d.items(): print(f"{k, v =}")

k, v =('r', 5)
k, v =('o', 6)
k, v =('i', -9)
k, v =('z', 7)
k, v =('a', 9)
k, v =('x', 2)
k, v =('w', 10)
k, v =('p', 0)
k, v =('h', 0)


So if we think of why we might want to use Pandas, it is not because of the strange naming of all these things, ```itertuples```, ```iterrows```, maybe it's because of the really bizarre erros that come up when you use Pandas.

Here we have a Pandas Series, and it seems to be just some numeric values, and here we have a Pandas DataFrame:

In [80]:
from pandas import MultiIndex

In [87]:
rnd = default_rng(0)
s = Series(rnd.integers(-10, 10, size=5))
df1 = DataFrame(rnd.integers(-10, 10, size=(5,3)))
df2 = DataFrame(
    index=(idx := [1, 1, 2, 3, 4]),
    data=rng.integers(-10, 10, size=(len(idx), 3))
)
df3 = DataFrame(
    index=(idx := MultiIndex.from_product([[0], range(5)])),
    data=rng.integers(-10, 10, size=(len(idx), 3))
)

In [88]:
s

0    7
1    2
2    0
3   -5
4   -4
dtype: int64

In [89]:
df1

Unnamed: 0,0,1,2
0,-10,-9,-10
1,-7,6,2
2,8,0,2
3,9,4,2
4,0,1,8


We can multiply these, but we get a bunch of NaNs at the end, now is that useful?

In [92]:
s * df1

Unnamed: 0,0,1,2,3,4
0,-70,-18,0,,
1,-49,12,0,,
2,56,0,0,,
3,63,8,0,,
4,0,2,0,,


Matrix multiplication is not commutative, but here it doesn't make any difference:

In [93]:
df1 * s

Unnamed: 0,0,1,2,3,4
0,-70,-18,0,,
1,-49,12,0,,
2,56,0,0,,
3,63,8,0,,
4,0,2,0,,


Maybe I'll introduce another DataFrame into the story, and we'll see if anything else happens.

Here we have a different DataFrame, and if I add that to the first DataFrame, I get a bunch of NaNs at the beggining:

In [95]:
df2

Unnamed: 0,0,1,2
1,-8,1,4
1,6,0,-3
2,-4,-2,-1
3,4,7,-9
4,8,0,-3


In [94]:
df1 + df2

Unnamed: 0,0,1,2
0,,,
1,-15.0,7.0,6.0
1,-1.0,6.0,-1.0
2,4.0,-2.0,1.0
3,13.0,11.0,-7.0
4,8.0,1.0,5.0


That's not a big deal, because I can at least just drop de NaNs. I guess that's what I do all the time with Pandas, just drop NaNs, because they are popping up all over the place:

In [96]:
(df1 + df2).dropna()

Unnamed: 0,0,1,2
1,-15.0,7.0,6.0
1,-1.0,6.0,-1.0
2,4.0,-2.0,1.0
3,13.0,11.0,-7.0
4,8.0,1.0,5.0


I can join these, and get a bunch of NaNs, I guess I can just drop them:

In [98]:
df1.join(df2, rsuffix="-df2")

Unnamed: 0,0,1,2,0-df2,1-df2,2-df2
0,-10,-9,-10,,,
1,-7,6,2,-8.0,1.0,4.0
1,-7,6,2,6.0,0.0,-3.0
2,8,0,2,-4.0,-2.0,-1.0
3,9,4,2,4.0,7.0,-9.0
4,0,1,8,8.0,0.0,-3.0


Let's introduce another DataFrame here, which doesn't look too dissimilar from the second DataFrame we were looking at, and we try to join it we get an error, can't join withoun overlapping index names:

In [101]:
df3

Unnamed: 0,Unnamed: 1,0,1,2
0,0,3,1,-5
0,1,-4,4,1
0,2,0,-4,5
0,3,-3,-4,7
0,4,-5,-6,4


In [99]:
df1.join(df3, rsuffix="-df3")

ValueError: cannot join with no overlapping index names

Wait, joining should be like stacking them next to each other, kind of what like plus operation seems to be trying to do, so why it it trying to tell me non-overlapping index things, what is the heck does that even mean. And, if we are unlucky enough to ask one of our co-workers, what does this mean, why is Pandas giving me this error? The answer will be, you just have to rename the axis to something on both sides, and then the join will work as you expect:

In [100]:
(
    df1
    .rename_axis("idx")
    .join(
        df3
        .rename_axis([..., "idx"]),
        rsuffix="-df3"
    )
)

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,0-df3,1-df3,2-df3
Ellipsis,idx,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0,-10,-9,-10,3,1,-5
0,1,-7,6,2,-4,4,1
0,2,8,0,2,0,-4,5
0,3,9,4,2,-3,-4,7
0,4,0,1,8,-5,-6,4


And you say, what does that even mean, how is that helpful? That is total nonsense, rename axis? 

#### List makes sense

If you compare that to built-in datatypes, the list just makes sense. Here we have a list called xs and a list called ys, and these represent some opaque collection of items that we iterate over.

In [118]:
xs = [randint(-10, +10) for _ in range(5)]
ys = [randint(-10, +10) for _ in range(5)]

If we add them together:

In [119]:
xs + ys

[-7, -7, -4, -2, -3, -1, 1, 4, 3, 7]

This represents a container level operation, it concatenates them.

If we want to use some special syntax for this, we can unpack these lists into a list literal, contatenating them in a different fashion:

In [120]:
[*xs, *ys]

[-7, -7, -4, -2, -3, -1, 1, 4, 3, 7]

If want to actually add them up, line them up, we do a for loop, just zip them up together and add up the pair, this just makes sense, there is no rename and axis in this:

In [121]:
[x + y for x, y in zip(xs, ys)]

[-8, -6, 0, 1, 4]

#### Dictionary makes sense

A dictionary also makes a whole lot of sense.

Here we two dictionary with key value pairs:

In [125]:
d1 = {choice(ascii_lowercase): randint(-10, +10) for _ in range(10)}
d2 = {choice(ascii_lowercase): randint(-10, +10) for _ in range(10)}

We can use the unpacking syntax to merge these dictionaries:

In [127]:
{**d1, **d2}

{'r': 4,
 't': 3,
 'b': -9,
 'v': -10,
 'm': 0,
 'g': -6,
 'w': 0,
 'h': -4,
 'd': 6,
 'z': -8,
 'p': -6,
 'j': 3,
 'l': 5,
 'i': 10,
 'o': 9,
 'q': 3,
 'x': -10}

You take all key-value pairs from each dictionary, and there is an order of preference if there happens to be an overlap.

If we happen to use Python 3.9, we can do this with just the pipe syntax:

In [132]:
d1 | d2

{'r': 4,
 't': 3,
 'b': -9,
 'v': -10,
 'm': 0,
 'g': -6,
 'w': 0,
 'h': -4,
 'd': 6,
 'z': -8,
 'p': -6,
 'j': 3,
 'l': 5,
 'i': 10,
 'o': 9,
 'q': 3,
 'x': -10}

If we want to do an arythmetic operation, we can just do it, we take the keys of one and put them into a set, take the other's keys and put them into a set, find the set union of those, then look up the values, create a new dictionary with these, and if we don't see one of these pairings in one of the dictionaries, we can just substitute it with 0 by using the dictionary's get method:

In [133]:
{k: d1.get(k, 0) + d2.get(k, 0) for k in d1 | d2}

{'r': 4,
 't': 3,
 'b': -9,
 'v': -10,
 'm': 0,
 'g': -6,
 'w': -2,
 'h': -4,
 'd': 6,
 'z': -8,
 'p': -6,
 'j': 3,
 'l': 5,
 'i': 10,
 'o': 9,
 'q': 3,
 'x': -10}

#### ```collections.Counter``` makes sense

Even if we look in the Python standard library and look at the collection types it provides, they just make sense.

```collections.Counter``` totally makes sense. It is just kind of like a dictionary, except that it specifies that the values have to be some sort of integers or some sort of numeric values, some counts.

In [134]:
from collections import Counter

In [150]:
c1 = Counter({choice(ascii_lowercase): randint(-10, +10) for _ in range(10)})
c2 = Counter({choice(ascii_lowercase): randint(-10, +10) for _ in range(10)})

In [152]:
c1

Counter({'n': 8,
         'x': 5,
         'i': 4,
         'z': 2,
         'm': 1,
         'p': 0,
         'l': -2,
         'q': -9,
         'u': -9,
         't': -10})

In [153]:
c2

Counter({'h': 9,
         'w': 8,
         's': 6,
         'o': 3,
         'k': 2,
         'm': 1,
         'u': -1,
         't': -7,
         'd': -10})

We can add them together. If you have 4 "a"s in one and 5 in the other, you have 9 in the result, it makes sense:

In [154]:
c1 + c2

Counter({'h': 9,
         'n': 8,
         'w': 8,
         's': 6,
         'x': 5,
         'i': 4,
         'o': 3,
         'm': 2,
         'z': 2,
         'k': 2})

NOTE: it seems like it's only kept positive counts

We might even get the interception or the union of the counts:

In [151]:
c1 & c2

Counter({'m': 1})

In [155]:
c1 | c2

Counter({'h': 9,
         'n': 8,
         'w': 8,
         's': 6,
         'x': 5,
         'i': 4,
         'o': 3,
         'z': 2,
         'k': 2,
         'm': 1})

We might scratch our heads a little bit, and think this is streching it, what does that actually mean? But it actually makes sense, if you had to bags of things you have counted, you can wonder, what's the minimum or maximum, I can rely on having in either bag, these map to our intersect and union operators.

#### Is it because we constantly need to do ```.values``` to force it to do what we want?

That does quite not give us a good motivation as to why we might want to use Pandas, may it's because Pandas is so frustrating we might want to use .values all over the place just to get back, let's say, to a numpy ```ndarray```, something we know how to deal with.

Here we have a Series with numerical values, and dataframes:

In [157]:
s = Series(rng.integers(-10, +10, size=5))

In [158]:
s

0     2
1   -10
2    -9
3    -3
4     6
dtype: int64

In [163]:
df1 = DataFrame(rng.integers(-10, +10, size=(5,3)), columns=[*"abc"])
df2 = DataFrame(rng.integers(-10, +10, size=(5,3)), columns=[*"def"])

In [164]:
df1

Unnamed: 0,a,b,c
0,8,3,7
1,-7,5,8
2,-10,-3,2
3,-8,0,2
4,5,8,-2


In [165]:
df2

Unnamed: 0,d,e,f
0,-2,-1,9
1,-7,-1,-10
2,-2,8,2
3,-4,9,2
4,8,-10,-1


If we multiply these, we get all NaNs:

In [166]:
s * df1

Unnamed: 0,a,b,c,0,1,2,3,4
0,,,,,,,,
1,,,,,,,,
2,,,,,,,,
3,,,,,,,,
4,,,,,,,,


If I ```dropna``` that is not going to be helpful to me at all!

Let's see what happend if I add the DataFrames together:

In [167]:
df1 + df2

Unnamed: 0,a,b,c,d,e,f
0,,,,,,
1,,,,,,
2,,,,,,
3,,,,,,
4,,,,,,


They seem to be of the same size, but when I add them together I get all NaNs as well.

So let's just drop the Pandas away and do ```.values```:

In [170]:
df1.values + df2.values

array([[  6,   2,  16],
       [-14,   4,  -2],
       [-12,   5,   4],
       [-12,   9,   4],
       [ 13,  -2,  -3]])

If we do the same thing with the Series and a DataFrame, we get a broadcasting error:

In [172]:
s.values + df1.values

ValueError: operands could not be broadcast together with shapes (5,) (5,3) 

Even in the numpy universe our live isn't that easy; if we ask a helpful coworker he will say you will have to use that new axis, you will have to add to the axis to satisfy the broadcast rules:

In [176]:
s.values[:, None] + df1.values

array([[ 10,   5,   9],
       [-17,  -5,  -2],
       [-19, -12,  -7],
       [-11,  -3,  -1],
       [ 11,  14,   4]])

I would reply that you are speaking a different language and I have no idea of what you are speaking about.

If it's the case that you manage to coerce this to work, you ```.values``` your way to something that actually gives you the answer that you want, but you still want to have that DataFrame for whatever reason, you can take the result and stick it back into a DataFrame, I guess that's not too bad:

In [177]:
DataFrame(df1.values + df2.values)

Unnamed: 0,0,1,2
0,6,2,16
1,-14,4,-2
2,-12,5,4
3,-12,9,4
4,13,-2,-3


Doesn't seem like a very powerful reason for us to use Pandas.

In [None]:
#### Or perhaps that we sometimes get totally perplexing results and have to ```.reset_index()``` to coerce the library to 

Maybe is that in addition to that ```.values```, we have to ```.reset_index``` to kind of stay within Pandas and throw away whatever that index is, because that seems to be the source of our problems, and I guess that that might be a compelling reason to use Pandas, just to ```.reset_index()``` our way to success.

We have ```df```, we group by that ```a``` columns, and then 

In [185]:
df1 = DataFrame(rng.integers(-10, +10, size=(5,3)), columns=[*"abc"])
df2 = DataFrame(rng.integers(-10, +10, size=(5,3)), columns=[*"abc"])

In [186]:
df1

Unnamed: 0,a,b,c
0,6,-8,8
1,9,-5,-2
2,0,3,-2
3,-8,8,3
4,-10,6,4


In [187]:
df2

Unnamed: 0,a,b,c
0,-7,2,0
1,-10,8,4
2,-4,-10,-9
3,5,-8,0
4,7,8,-5


In [188]:
df1.groupby("a").sum()

Unnamed: 0_level_0,b,c
a,Unnamed: 1_level_1,Unnamed: 2_level_1
-10,6,4
-8,8,3
0,3,-2
6,-8,8
9,-5,-2


This doesn't do much, but at least it will sort by the "a" values.

If we take that and we add it to another DataFrame:

In [189]:
df1.groupby("a").sum() + df2

Unnamed: 0,a,b,c
-10,,,
-8,,,
0,,5.0,-2.0
1,,,
2,,,
3,,,
4,,,
6,,,
9,,,


We end up again with a bunch of NaN values, except for a tiny little slimmer there, but ```.dropna``` is not going to help.

You know what, I don't need to use ```.values```, I just use ```.reset_index``` on one side and get something that looks like what I want (in the ```.groupby``` I might be able to set a parameter to say don't use the index):

In [192]:
df1.groupby("a").sum().reset_index() + df2

Unnamed: 0,a,b,c
0,-17,8,4
1,-18,16,7
2,-4,-7,-11
3,11,-16,8
4,16,3,-7


#### How many rows do we get (Good tools should be predictable)

Maybe it's because Pandas is so unpredictable in terms of what it will give us when we perform an operation, a good tool should be guessable, we should be able to write some code and have a sense for what it's going to do before we run that line of code. 

Pandas doesn't seem to be the case, because even with something simple, like figuring how many rows I am going to get when doing an operation with two Pandas objects, I can't guess what that is going to be.

Like when I have these two Series, which just contain 4 values, and I sum them up, ok, the result has 4 values, that's not too bad:

In [200]:
s1 = Series(rng.integers(-10, +10, size=4), index=[*"aabb"])
s2 = Series(rng.integers(-10, +10, size=4), index=[*"aabb"])

In [201]:
print(s1, s2, sep="\n\n")

a   -2
a    7
b    9
b   -8
dtype: int64

a    1
a    5
b   -5
b   -5
dtype: int64


In [202]:
s1 + s2

a    -1
a    12
b     4
b   -13
dtype: int64

But when I try the same with these 2 other series, which also have 4 values, now I get 6 values. What?

In [203]:
s1 = Series(rng.integers(-10, +10, size=4), index=[*"aaab"])
s2 = Series(rng.integers(-10, +10, size=4), index=[*"abbb"])

In [205]:
print(s1, s2, sep="\n\n")

a   -6
a   -6
a    7
b   -6
dtype: int64

a   -6
b   -8
b   -8
b    5
dtype: int64


In [204]:
s1 + s2

a   -12
a   -12
a     1
b   -14
b   -14
b    -1
dtype: int64

Again, if we try adding these two other series, which also have 4 values, we get 8 values in the result this time:

In [206]:
s1 = Series(rng.integers(-10, +10, size=4), index=[*"aabb"])
s2 = Series(rng.integers(-10, +10, size=4), index=[*"abbb"])

In [207]:
print(s1, s2, s1 + s2, sep="\n\n")

a   -5
a    6
b    1
b    7
dtype: int64

a    1
b    5
b    6
b   -9
dtype: int64

a    -4
a     7
b     6
b     7
b    -8
b    12
b    13
b    -2
dtype: int64


I can't even predict how many values I am going to get when adding these two things. That would never happen with the Python dictionary or list, I can explicitly do how I want to combine these two structures, and I can see immediatly from this code, which might be a little bit clumsy, what is going on. This is a pretty terrible reason why we might want to use Pandas, because we have very little predictability over what is going on.

Maybe what I want to do is ```reset_index``` my way to success here, and then I end up with something that kind of makes sense, except for that column named "index", that is not very useful:

In [208]:
print(s1, s2, s1.reset_index() + s2.reset_index(), sep="\n\n")

a   -5
a    6
b    1
b    7
dtype: int64

a    1
b    5
b    6
b   -9
dtype: int64

  index   0
0    aa  -4
1    ab  11
2    bb   7
3    bb  -2


Instead of ```reset_index```, just throw the index in the garbage, not helpful at all, not really adding anything to my life:

In [209]:
print(s1, s2, s1.reset_index(drop=True) + s2.reset_index(drop=True), sep="\n\n")

a   -5
a    6
b    1
b    7
dtype: int64

a    1
b    5
b    6
b   -9
dtype: int64

0    -4
1    11
2     7
3    -2
dtype: int64


#### Is it because the API makes weird, incomprehensible distinctions, like the ```.groupby``` operations for user defined

Maybe the reason we use Pandas is because it makes weird, incomprehensible distinctinos in its library.

For example, if we have a DataFrame, and we want to go onto that ```.groupby```, and I want to group by that columns that is not longer unique values, but two values, True of False, the place where "a" is True or False.

When I ```.groupby``` and do a sum, that works:

In [210]:
from numpy import repeat

In [215]:
df = DataFrame({
    "a": repeat([True, False], (size := 8)//2),
    "b": rng.integers(-10, +10, size)
})

In [217]:
print(df, df.groupby("a").sum(), sep="\n\n")

       a  b
0   True  1
1   True -1
2   True -5
3   True -1
4  False -2
5  False -1
6  False  6
7  False  6

       b
a       
False  9
True  -6


But if we want to do another operation, like, say, the kurtosis, that is not built-in into the ```.groupby```, 
we can't do ```.groupby.kurt()```, so we might do a ```.groupby.transform```:

In [218]:
print(df, df.groupby("a").transform(lambda x: x.kurt()), sep="\n\n")

       a  b
0   True  1
1   True -1
2   True -5
3   True -1
4  False -2
5  False -1
6  False  6
7  False  6

          b
0  2.227147
1  2.227147
2  2.227147
3  2.227147
4 -5.737429
5 -5.737429
6 -5.737429
7 -5.737429


This gives me, what, 8 rows? What on earth is going on here?

Maybe I'll do a ```.apply```:

In [220]:
print(df, df.groupby("a").apply(lambda x: x.kurt()), sep="\n\n")

       a  b
0   True  1
1   True -1
2   True -5
3   True -1
4  False -2
5  False -1
6  False  6
7  False  6

         a         b
a                   
False  0.0 -5.737429
True   0.0  2.227147


That gives me "a" and "b" like that? That's bizarre?

And maybe, eventually, we'll end up doing a ```.agg```:

In [222]:
print(df, df.groupby("a").agg(lambda x: x.kurt()), sep="\n\n")

       a  b
0   True  1
1   True -1
2   True -5
3   True -1
4  False -2
5  False -1
6  False  6
7  False  6

              b
a              
False -5.737429
True   2.227147


That seems to sort of give me the right thing. Whay on earth did they create ```.transform```, ```.apply``` and ```.agg```? Why not have just one? Why not make this easy for me? What on earth does this even mean?

#### Maybe we use Pandas becasue it is full of minor conveniences that allow us to eliminate more or less one line of code.

For instance, we can create a Series and the shift it why one value:

In [223]:
s = Series(rng.integers(-10, +10, size=5))

In [224]:
print(s, s.shift(1), sep="\n\n")

0    2
1    4
2    9
3    2
4   -3
dtype: int64

0    NaN
1    2.0
2    4.0
3    9.0
4    2.0
dtype: float64


Or we can shift it and subtract from itself:

In [227]:
print(s, s.diff(1), sep="\n\n")

0    2
1    4
2    9
3    2
4   -3
dtype: int64

0    NaN
1    2.0
2    5.0
3   -7.0
4   -5.0
dtype: float64


Or we could do ```Series.sum```:

In [230]:
print(s, f"{s.sum() = :.2f}", sep="\n\n")

0    2
1    4
2    9
3    2
4   -3
dtype: int64

s.sum() = 14.00


Or ```Series.mean```:

In [232]:
print(s, f"{s.mean() = :.2f}")

0    2
1    4
2    9
3    2
4   -3
dtype: int64 s.mean() = 2.80


Or ```Series.skew```:

In [233]:
print(s, f"{s.skew() = :.2f}")

0    2
1    4
2    9
3    2
4   -3
dtype: int64 s.skew() = 0.23


Or ```Series.kurt```:

In [234]:
print(s, f"{s.kurt() = :.2f}")

0    2
1    4
2    9
3    2
4   -3
dtype: int64 s.kurt() = 1.34


And between you and me, I don't even know what skew or kurtosis even means. Sure, kurtosis is likek the scaled 4th moment of the distribution? No clue what that actually means. But I do know just have imported this from ```scipy.stats``` and I could just used the ```numpy.ndarray```, and I couls just have used indexing on it to chop off the first element. If I am going to drop de NaN anyway, what's the difference? I could just have done a substraction, that seems much clearer.
The other operations, ```sum``` and ```mean```, are already provided by numpy, and ```skew``` and ```kurtosis```, fine, I've got to eliminate one ```scipy.stats``` import, it doesn't really seem like a very compelling reason for why we should use Pandas:

In [239]:
print(
    (xs := rng.integers(-10, +10, size=5)), 
    f"diff = {xs[:1] - xs[:-1]}",
    f"{xs.sum() = :.2f}",
    f"{xs.mean() = :.2f}",
    sep="\n\n"
)

[ 6  5 -2  7  6]

diff = [ 0  1  8 -1]

xs.sum() = 22.00

xs.mean() = 4.40


#### Is it because the DataFrame gives us the ability to store multiple one-dimensional data sets, at the cost of about 250K lines worth of code complexity?

All this to have a dictionary of ```numpy.ndarray```s

I can do operations down DataFrames, such as sum, create a "sums" column, and will take me as many as 2 or 3 lines of code, maybe a dictionary and a dictionary comprehension if I wanted to do this is pure numpy. And I guess that I can sum accross the columns, and maybe that might maybe take me one more line of code, and for all of this complexity that I am removing, all I have to pay is 250_000 lines of additional code complexity in one of my dependencies, that is not such a high price to pay:

In [240]:
df = DataFrame({
    "a": Series(rng.integers(-10, +10, size=5)),
    "b": Series(rng.integers(-10, +10, size=5)),
    "c": Series(rng.integers(-10, +10, size=5)),
})

In [241]:
print(
    df,
    df.sum(),
    df.sum(axis="columns"),
    df.groupby("a").sum(),
    sep="\n\n"
)

    a  b  c
0  -8 -9  9
1 -10 -5  4
2  -8  3  2
3  -3  0  8
4  -8 -5  5

a   -37
b   -16
c    28
dtype: int64

0    -8
1   -11
2    -3
3     5
4    -8
dtype: int64

      b   c
a          
-10  -5   4
-8  -11  16
-3    0   8


#### 250_000 lines of code?

**My way:**

In [247]:
import pandas
from pathlib import Path

In [290]:
df = DataFrame({
       "file": [file.as_posix() for file in Path(pandas.__file__).parent.glob("**/*.py")],
})

In [268]:
def lines(file):
    with open(file) as f:
        return len(f.readlines())

In [294]:
df["lines"] = df.apply(lambda row: lines(row["file"]), axis=1)

In [296]:
df["lines"].sum()

586084

**James's way**:

In [297]:
from subprocess import run, check_output
from pathlib import Path
from io import StringIO
from pandas import read_csv, MultiIndex, IndexSlice

In [307]:
%%time
# we take the Pandas source code
d = Path("/tmp/Pandas")
if not d.exists():
    run([*"git clone --depth 1 https://github.com/pandas-dev/pandas".split(), d])

Cloning into '/tmp/Pandas'...


CPU times: user 5.32 ms, sys: 17.5 ms, total: 22.8 ms
Wall time: 17.2 s


In [366]:
# we take every single source file in the pandas directory
# and put the result in a Pandas Series
# and we find the length of each of the files
s = read_csv(
    StringIO(wc),
    delimiter=" ",
    skipinitialspace=True, # because lines start with multiple 
                           # spaces
    #engine="python",
    index_col=[1],
    #squeeze=True,         # parameter not supported
    names="lines path".split()
    # remove every other line, the "total" at the end of `wc`
)[::2].pipe(lambda s: s.set_axis(map(Path, s.index))) 

In [376]:
# how many lines of code are in each of those files
# I don't care about all of those files, some of them are tests
print(
    # That gives me a mask, I'll take it and do a .loc operation
    # to just pick out the files that don't belong in those
    # testing directories
    s.loc[    
        s
        # index this and remove any entry where it's from a
        # _testing or tests directory
        .index.map(lambda p: not p.is_relative_to("pandas/_testing") and not p.is_relative_to("pandas/tests"))
    ]
    # I'll group the result by the parent-most directory and
    # the file's suffix, 
    .pipe(
        lambda s:
            s.groupby([
                s.index.map(lambda p: p.parents[1]),
                s.index.map(lambda p: p.suffix),
            ]).sum()
    )
    # I'll unstack this and fill with zeroes if it doesn't 
    # happen to have anything, so that I can get a sense for
    # what's there; there are quite some C code lines
    # ~80k lines in the core and ~70k lines in the libs
    .unstack(fill_value=0)
    # if I sum this it will give me the sum on a pre file type basis
    # 
    .sum()
    # ~222K lines of .py code
    # and if I then sum that up, I get the total #lines of code
    .sum()
    # ~280K
)

284165


#### And for what?

All that bunch of code for something that doesn't make any sense and that I am struggling with all the time does not seem like a good bet.

Because here is a data frame that I want to update. I do this all the time to Python lists and dictionaries.

In [385]:
df = DataFrame({
    "a": rng.integers(-10, +10, size=(size := 5)),
    "b": rng.integers(-10, +10, size=size),
    "c": rng.choice([*ascii_lowercase], size=size)
})

In [388]:
df

Unnamed: 0,a,b,c
0,9,2,d
1,1,7,c
2,8,9,r
3,9,-6,b
4,4,1,s


I'll something very simple, go to that "a" columns, go to that 0 value in that column, and multiply that by 10_000.

In [389]:
df["a"][0] *= 10_000

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["a"][0] *= 10_000


In [390]:
print(
    df,
    sep="\n\n"
)

       a  b  c
0  90000  2  d
1      1  7  c
2      8  9  r
3      9 -6  b
4      4  1  s


It seems to work, except that my DataFrame seems to be slightly different with one extra columns, I get a setting with a warning.
250_000 lines of code, and I still have no idea of what's going on.
This does not seem like a compelling reason to use Pandas.

## Why do we even use Pandas?

Our goal here is to try to figure out how do we become a Pandas expert by understanding the core concepts, and I think the core idea is understanding why we use Pandas in the first place.

### We are not here to talk about Numpy

So lets talk about Numpy. Why use Numpy? Why do we even bother with `numpy` in the first place?

Numpy is fast. Python is slow.

Here we have a simple little timer:

In [392]:
from contextlib import contextmanager
from time import perf_counter, sleep

In [393]:
@contextmanager
def timed(msg):
    start = perf_counter()
    try:
        yield
    finally:
        stop = perf_counter()
        print(f"{msg:<24} elapsed \N{mathematical bold capital delta}t: {stop - start:.6f}")

We'll time some; if I sleep for one second, it takes a little bit more than one second:

In [396]:
with timed("sleep one second"):
    sleep(1)

sleep one second         elapsed 𝚫t: 1.001011


You can see it is not a great timer but it's approximately a decent timer.

If I use pure Python, the Python list I am familiar with, and I create two lists size 100_000:

In [391]:
from random import randint as py_randint
from numpy.random import randint as np_randint
from numpy import dot as np_dot

In [405]:
SIZE = 100_000

with timed("py: create"):
    py_xs = [py_randint(-10, +10) for _ in range(SIZE)]
    py_ys = [py_randint(-10, +10) for _ in range(SIZE)]

py: create               elapsed 𝚫t: 0.104947


It takes about half a second (sic).

If I do the same thing with numpy, it takes about a 100 times less (sic):

In [409]:
with timed("py: create"):
    np_xs = np_randint(-10, +10, size=SIZE)
    np_ys = np_randint(-10, +10, size=SIZE)

py: create               elapsed 𝚫t: 0.007849


That is a benefit, fast code is good code.

If I take these operations and I compute their dot product:

In [410]:
py_dot = lambda xs, ys: sum([x * y for x, y in zip(xs, ys)])

with timed("py: dot"):
    py_dot(py_xs, py_ys)

py: dot                  elapsed 𝚫t: 0.035559


It takes about .2 seconds (sic) to do the dot product in pure Python.
I can totally understand what is going on here, take each of the values in x, take each of the values in y, pair them up, multiply them and sum the result, that's it.

But if I do the same think wiht numpy I get an incredible improvement in the speed (sic):

In [411]:
with timed("np: dot"):
    np_dot(np_xs, np_ys)

np: dot                  elapsed 𝚫t: 0.000283


That is 70 times faster. 70 times speedup in some code? That's definitely worth it.

#### `numpy.ndarray` is an interpreted view of raw memory

One of the core ideas I have when I use numpy and I try to motivate why to use numpy is that it provides us with a way to do numerical operations faster because numpy is a restricted computation domain.

Namelly, it is the ability for us to put a manager class around some sort of Python data to intermediate between the Python layer of our code and some code that is implemented in perhaps C or Fortran. Because we have that implementation barrier, we can do things like unboxing values, make them contiguous, exact control of memory, eliminate dynamic dispatch, and as a consequence we are getting 100 times speedups, 80, 70 times speedups, we are getting a significant increase in the speed of the code.

The reason we should think of numpy as a restricted computation environment is, it's its domain, and we have to stay inside that domain, because we lose all that performance if we take our Python dot product and apply it to my numpy data, and cross the boundary of that domain, and it's slower than if I had done it in all pure Python (sic):

In [412]:
with timed("np: dot (py data)"):
    np_dot(py_xs, py_ys)

np: dot (py data)        elapsed 𝚫t: 0.001629


This restricted computation domain idea is very important, because it gives us that fundamental motivation for why do we stay within numpy, why do we stay within pandas, because it's a domain which has intermediated between the pure python level and some implementation level. As long as you stay in that implementation level everything is fine, but if you cross that boundary you are creating a number of additional costs that are going to be worse than if you stayed in one side or the other. It's a domain which gives you certain restrictions to allow you to do computations faster. And, if we think of what numpy really is, it is just an interpreted view of raw memory.

Here we have a ```numpy.ndarray```, and we can dig in it to see that this is actually raw memory at that memory location that we are interpreting in some fashion:

In [413]:
from numpy import array

In [414]:
xs = array([0, 1, 2])
print(
    f"{xs                                = }",
    f"{xs.__array_interface__['data'][0] = :#_x}",
    f"{xs.dtype                          = }",
    f"{xs.shape                          = }",
    f"{xs.strides                        = }",
    sep="\n",
)

xs                                = array([0, 1, 2])
xs.__array_interface__['data'][0] = 0x6000_0307_c800
xs.dtype                          = dtype('int64')
xs.shape                          = (3,)
xs.strides                        = (8,)


It says, some block at this location contains int64 values, and we interprit it to be 3 values in a one-dimensional structure.
And we have some mechanisms why which in constant time we can move through these values, some striding mechanism. All of these pieces fit together when you think about numpy.

#### numpy provides us with a "mathematical" type-what we would call a "vector" or "matrix" or "tensor"-with corresponding operations

It also provides us with something that is missing from Python.

In other words, if we have our python list, or some sort of opaque collection of items. If we add two lists together, because they are opaque collections of items, stuff we iterate over, addition of these two is just that stuff we iterate over is concatenated; or if we multiply something that is a list, and it's just a bag of stuff, and we just say, give me that bag of stuff 3 times over, repeat this thing:

In [415]:
xs = [1, 2, 3]
ys = [4, 5, 6]

print(f"{xs + ys = }")
print(f"{xs *  3 = }")

xs + ys = [1, 2, 3, 4, 5, 6]
xs *  3 = [1, 2, 3, 1, 2, 3, 1, 2, 3]


Whereas, if we do the same on a ```numpy.ndarray```, we get what we expect to be mathematical operations, vector multiplication, vector addition, this is not opaque, we know what is inside, numeric values, and when you add these together, match them up and add them together, go to each value and element-wise multiply these:

In [416]:
xs = array([1, 2, 3])
ys = array([4, 5, 6])

print(f"{xs + ys = }")
print(f"{xs *  3 = }")

xs + ys = array([5, 7, 9])
xs *  3 = array([3, 6, 9])


That kind of makes sense.

#### ```numpy.ndarray``` provides us with a fixed-size, dynamic-shape, higher-dimensional structure

It is important that we talk about dimensionality, because this is related to one of the core concepts that we have to understand about Pandas, namely, Pandas is one-dimensional data, even though the documentation says it's two-dimensional data, it is really like aligned one-dimensional data.

Let's think about what that means.