# So you want to be a Pandas expert

In [1]:
#!pip install scipy

The core concept of Pandas is the index, and index alignment.

## Why use Pandas?

Let's talk about, why Pandas in the first place?
Why do we even bother with Pandas?
Because, in my opinion, Pandas is a very one-dimensional tool, there is basically only one reason to use it (and there is a pun in there too).
And if you think of all the different reasons you might have landed on learning pandas, there are a lot of reasons that might come up, but they are not very good reasons.

You might want the use Pandas because you like the Python ```dict``` and ```list```, and you might want something even less convenient, and even more perplexing.

### Pandas Series

For instance, here is this thing called a Pandas series.

In [2]:
from pandas import Series, DataFrame
from numpy.random import default_rng

rng = default_rng(0)

s = Series(rng.integers(-10, +10, size=5))

print(s)

0    7
1    2
2    0
3   -5
4   -4
dtype: int64


It prints a little bit differently than a list, and we can iterate over the contents of it using ```iteritems```:

In [3]:
for x in s.items(): print(f"{x=}")

x=(0, 7)
x=(1, 2)
x=(2, 0)
x=(3, -5)
x=(4, -4)


We get the numbering of the rows, i don't know why that's useful, so maybe we'll just unpack that into a variable that you don't use:

In [4]:
for _, x in s.items(): print(f"{x=}")

x=7
x=2
x=0
x=-5
x=-4


or we can iterate over this directly, but I can't see what this is giving us over a list:

In [5]:
for x in s: print(f"{x=}")

x=7
x=2
x=0
x=-5
x=-4


Or, more likely, if we are using Pandas, we are more familiar with the Pandas DataFrame:

In [6]:
df = DataFrame(rng.integers(-10, +10, size=(5,3)), columns=[*"abc"])

In [7]:
df

Unnamed: 0,a,b,c
0,-10,-9,-10
1,-7,6,2
2,8,0,2
3,9,4,2
4,0,1,8


This looks kind of a dictionary of Python lists, or maybe a bunch of lists stacked next to each other, or maybe some sort of matrix.

And, again, we can iterate over this thing, but then we get something really weird:

In [8]:
for x in df: print(f"{x=}")

x='a'
x='b'
x='c'


We get the column names. How is that useful to anybody?

So I gues we could try ```iterrows```:

In [9]:
for x in df.iterrows(): print(f"{x=}")

x=(0, a   -10
b    -9
c   -10
Name: 0, dtype: int64)
x=(1, a   -7
b    6
c    2
Name: 1, dtype: int64)
x=(2, a    8
b    0
c    2
Name: 2, dtype: int64)
x=(3, a    9
b    4
c    2
Name: 3, dtype: int64)
x=(4, a    0
b    1
c    8
Name: 4, dtype: int64)


But I have no idea what this is giving me, some row and each of the columns.

Or ```iteritems```:

In [10]:
for x in df.items(): print(f"{x=}")

x=('a', 0   -10
1    -7
2     8
3     9
4     0
Name: a, dtype: int64)
x=('b', 0   -9
1    6
2    0
3    4
4    1
Name: b, dtype: int64)
x=('c', 0   -10
1     2
2     2
3     2
4     8
Name: c, dtype: int64)


That is kind of like a dictionary, right? This gives me like each of the columns independently. I guess I can throw away the column name and get the column:

In [11]:
for _, x in df.items(): print(f"{x=}")

x=0   -10
1    -7
2     8
3     9
4     0
Name: a, dtype: int64
x=0   -9
1    6
2    0
3    4
4    1
Name: b, dtype: int64
x=0   -10
1     2
2     2
3     2
4     8
Name: c, dtype: int64


I don't know what's that.

Or I can do ```itertuples```

In [12]:
for x in df.itertuples(): print(f"{x=}")

x=Pandas(Index=0, a=-10, b=-9, c=-10)
x=Pandas(Index=1, a=-7, b=6, c=2)
x=Pandas(Index=2, a=8, b=0, c=2)
x=Pandas(Index=3, a=9, b=4, c=2)
x=Pandas(Index=4, a=0, b=1, c=8)


There I can start getting close to something I can deal with: tuples, lists, dictionaries, I understand those things.

Because, if you compare a Pandas DataFrame or Series to the built-in datatypes in Python, the Python list, the Python set, the Python dictionary, I'd say, in terms of straightforwardness, the built-in types really went out.

Here we have python list, and it's just a bunch of numbers:

In [13]:
from random import seed; seed(0)
from random import choice, randint
from string import ascii_lowercase

In [14]:
xs = [randint(-10, +10) for _ in range(10)]

In [15]:
xs

[2, 3, -9, -2, 6, 5, 2, -1, 5, 1]

A structure we iterate over, and we can get the numbers and do something:

In [16]:
for x in xs: print(f"{x=}")

x=2
x=3
x=-9
x=-2
x=6
x=5
x=2
x=-1
x=5
x=1


We also have the dictionary, as set of key value pairs:

In [17]:
d = {choice(ascii_lowercase): randint(-10, +10) for _ in range(10)}

In [18]:
d

{'s': -4,
 'q': -6,
 'j': -6,
 'y': -7,
 't': -2,
 'r': 9,
 'e': -1,
 'd': -8,
 'v': 0,
 'p': 7}

We can also iterate over the keys:

In [19]:
for k in d: print(f"{k=}")

k='s'
k='q'
k='j'
k='y'
k='t'
k='r'
k='e'
k='d'
k='v'
k='p'


Which we can also do explicitly:

In [20]:
for k in d.keys(): print(f"{k=}")

k='s'
k='q'
k='j'
k='y'
k='t'
k='r'
k='e'
k='d'
k='v'
k='p'


We can also go over the individual values:

In [21]:
for v in d.values(): print(f"{v=}")

v=-4
v=-6
v=-6
v=-7
v=-2
v=9
v=-1
v=-8
v=0
v=7


We can also go over the pairing of the keys and the values; that seems also a straightforward and very simple api:

In [22]:
for k,v in d.items(): print(f"{k, v =}")

k, v =('s', -4)
k, v =('q', -6)
k, v =('j', -6)
k, v =('y', -7)
k, v =('t', -2)
k, v =('r', 9)
k, v =('e', -1)
k, v =('d', -8)
k, v =('v', 0)
k, v =('p', 7)


So if we think of why we might want to use Pandas, it is not because of the strange naming of all these things, ```itertuples```, ```iterrows```, maybe it's because of the really bizarre erros that come up when you use Pandas.

Here we have a Pandas Series, and it seems to be just some numeric values, and here we have a Pandas DataFrame:

In [23]:
from pandas import MultiIndex

In [24]:
rnd = default_rng(0)
s = Series(rnd.integers(-10, 10, size=5))
df1 = DataFrame(rnd.integers(-10, 10, size=(5,3)))
df2 = DataFrame(
    index=(idx := [1, 1, 2, 3, 4]),
    data=rng.integers(-10, 10, size=(len(idx), 3))
)
df3 = DataFrame(
    index=(idx := MultiIndex.from_product([[0], range(5)])),
    data=rng.integers(-10, 10, size=(len(idx), 3))
)

In [25]:
s

0    7
1    2
2    0
3   -5
4   -4
dtype: int64

In [26]:
df1

Unnamed: 0,0,1,2
0,-10,-9,-10
1,-7,6,2
2,8,0,2
3,9,4,2
4,0,1,8


We can multiply these, but we get a bunch of NaNs at the end, now is that useful?

In [27]:
s * df1

Unnamed: 0,0,1,2,3,4
0,-70,-18,0,,
1,-49,12,0,,
2,56,0,0,,
3,63,8,0,,
4,0,2,0,,


Matrix multiplication is not commutative, but here it doesn't make any difference:

In [28]:
df1 * s

Unnamed: 0,0,1,2,3,4
0,-70,-18,0,,
1,-49,12,0,,
2,56,0,0,,
3,63,8,0,,
4,0,2,0,,


Maybe I'll introduce another DataFrame into the story, and we'll see if anything else happens.

Here we have a different DataFrame, and if I add that to the first DataFrame, I get a bunch of NaNs at the beggining:

In [29]:
df2

Unnamed: 0,0,1,2
1,-5,6,3
1,-10,-3,7
2,1,-10,5
3,4,6,-7
4,-9,7,-10


In [30]:
df1 + df2

Unnamed: 0,0,1,2
0,,,
1,-12.0,12.0,5.0
1,-17.0,3.0,9.0
2,9.0,-10.0,7.0
3,13.0,10.0,-5.0
4,-9.0,8.0,-2.0


That's not a big deal, because I can at least just drop de NaNs. I guess that's what I do all the time with Pandas, just drop NaNs, because they are popping up all over the place:

In [31]:
(df1 + df2).dropna()

Unnamed: 0,0,1,2
1,-12.0,12.0,5.0
1,-17.0,3.0,9.0
2,9.0,-10.0,7.0
3,13.0,10.0,-5.0
4,-9.0,8.0,-2.0


I can join these, and get a bunch of NaNs, I guess I can just drop them:

In [32]:
df1

Unnamed: 0,0,1,2
0,-10,-9,-10
1,-7,6,2
2,8,0,2
3,9,4,2
4,0,1,8


In [33]:
df2

Unnamed: 0,0,1,2
1,-5,6,3
1,-10,-3,7
2,1,-10,5
3,4,6,-7
4,-9,7,-10


In [34]:
df1.join(df2, rsuffix="-df2")

Unnamed: 0,0,1,2,0-df2,1-df2,2-df2
0,-10,-9,-10,,,
1,-7,6,2,-5.0,6.0,3.0
1,-7,6,2,-10.0,-3.0,7.0
2,8,0,2,1.0,-10.0,5.0
3,9,4,2,4.0,6.0,-7.0
4,0,1,8,-9.0,7.0,-10.0


Let's introduce another DataFrame here, which doesn't look too dissimilar from the second DataFrame we were looking at, and we try to join it we get an error, can't join withoun overlapping index names:

In [35]:
df3

Unnamed: 0,Unnamed: 1,0,1,2
0,0,0,-9,-5
0,1,-1,-2,-2
0,2,-10,-10,-8
0,3,-10,3,0
0,4,2,-5,2


In [36]:
try:
    df1.join(df3, rsuffix="-df3")
except Exception as ex:
    print(ex)

cannot join with no overlapping index names


Wait, joining should be like stacking them next to each other, kind of what like plus operation seems to be trying to do, so why it it trying to tell me non-overlapping index things, what is the heck does that even mean. And, if we are unlucky enough to ask one of our co-workers, what does this mean, why is Pandas giving me this error? The answer will be, you just have to rename the axis to something on both sides, and then the join will work as you expect:

In [37]:
(
    df1
    .rename_axis("idx")
    .join(
        df3
        .rename_axis([..., "idx"]),
        rsuffix="-df3"
    )
)

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,0-df3,1-df3,2-df3
Ellipsis,idx,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0,-10,-9,-10,0,-9,-5
0,1,-7,6,2,-1,-2,-2
0,2,8,0,2,-10,-10,-8
0,3,9,4,2,-10,3,0
0,4,0,1,8,2,-5,2


And you say, what does that even mean, how is that helpful? That is total nonsense, rename axis? 

#### List makes sense

If you compare that to built-in datatypes, the list just makes sense. Here we have a list called xs and a list called ys, and these represent some opaque collection of items that we iterate over.

In [38]:
xs = [randint(-10, +10) for _ in range(5)]
ys = [randint(-10, +10) for _ in range(5)]

If we add them together:

In [39]:
xs + ys

[-7, 1, 3, 0, 9, 10, -4, 7, 5, 4]

This represents a container level operation, it concatenates them.

If we want to use some special syntax for this, we can unpack these lists into a list literal, contatenating them in a different fashion:

In [40]:
[*xs, *ys]

[-7, 1, 3, 0, 9, 10, -4, 7, 5, 4]

If want to actually add them up, line them up, we do a for loop, just zip them up together and add up the pair, this just makes sense, there is no rename and axis in this:

In [41]:
[x + y for x, y in zip(xs, ys)]

[3, -3, 10, 5, 13]

#### Dictionary makes sense

A dictionary also makes a whole lot of sense.

Here we two dictionary with key value pairs:

In [42]:
d1 = {choice(ascii_lowercase): randint(-10, +10) for _ in range(10)}
d2 = {choice(ascii_lowercase): randint(-10, +10) for _ in range(10)}

We can use the unpacking syntax to merge these dictionaries:

In [43]:
{**d1, **d2}

{'q': 5,
 'b': 7,
 'a': 9,
 'x': 2,
 'w': -7,
 'p': 0,
 'h': -3,
 'g': 8,
 'z': 7,
 'o': -8,
 'c': 0,
 'd': -1,
 'r': 0}

You take all key-value pairs from each dictionary, and there is an order of preference if there happens to be an overlap.

If we happen to use Python 3.9, we can do this with just the pipe syntax:

In [44]:
d1 | d2

{'q': 5,
 'b': 7,
 'a': 9,
 'x': 2,
 'w': -7,
 'p': 0,
 'h': -3,
 'g': 8,
 'z': 7,
 'o': -8,
 'c': 0,
 'd': -1,
 'r': 0}

If we want to do an arythmetic operation, we can just do it, we take the keys of one and put them into a set, take the other's keys and put them into a set, find the set union of those, then look up the values, create a new dictionary with these, and if we don't see one of these pairings in one of the dictionaries, we can just substitute it with 0 by using the dictionary's get method:

In [45]:
{k: d1.get(k, 0) + d2.get(k, 0) for k in d1 | d2}

{'q': 3,
 'b': 7,
 'a': 9,
 'x': 2,
 'w': -15,
 'p': 0,
 'h': -3,
 'g': 8,
 'z': 7,
 'o': -8,
 'c': 0,
 'd': -1,
 'r': 0}

#### ```collections.Counter``` makes sense

Even if we look in the Python standard library and look at the collection types it provides, they just make sense.

```collections.Counter``` totally makes sense. It is just kind of like a dictionary, except that it specifies that the values have to be some sort of integers or some sort of numeric values, some counts.

In [46]:
from collections import Counter

In [47]:
c1 = Counter({choice(ascii_lowercase): randint(-10, +10) for _ in range(10)})
c2 = Counter({choice(ascii_lowercase): randint(-10, +10) for _ in range(10)})

In [48]:
c1

Counter({'c': 9, 'r': 8, 'k': 8, 'j': 4, 'z': 2, 'h': -1, 'f': -9})

In [49]:
c2

Counter({'g': 8,
         'c': 7,
         'w': 6,
         'i': 6,
         'v': 2,
         't': -2,
         'z': -3,
         'p': -8,
         'e': -9})

We can add them together. If you have 4 "a"s in one and 5 in the other, you have 9 in the result, it makes sense:

In [50]:
c1 + c2

Counter({'c': 16, 'r': 8, 'k': 8, 'g': 8, 'w': 6, 'i': 6, 'j': 4, 'v': 2})

NOTE: it seems like it's only kept positive counts

We might even get the interception or the union of the counts:

In [51]:
c1 & c2

Counter({'c': 7})

In [52]:
c1 | c2

Counter({'c': 9,
         'r': 8,
         'k': 8,
         'g': 8,
         'w': 6,
         'i': 6,
         'j': 4,
         'z': 2,
         'v': 2})

We might scratch our heads a little bit, and think this is streching it, what does that actually mean? But it actually makes sense, if you had to bags of things you have counted, you can wonder, what's the minimum or maximum, I can rely on having in either bag, these map to our intersect and union operators.

#### Is it because we constantly need to do ```.values``` to force it to do what we want?

That does quite not give us a good motivation as to why we might want to use Pandas, maybe it's because Pandas is so frustrating we might want to use .values all over the place just to get back, let's say, to a numpy ```ndarray```, something we know how to deal with.

Here we have a Series with numerical values, and dataframes:

In [53]:
s = Series(rng.integers(-10, +10, size=5))

In [54]:
s

0    5
1   -3
2   -1
3    9
4    6
dtype: int64

In [55]:
df1 = DataFrame(rng.integers(-10, +10, size=(5,3)), columns=[*"abc"])
df2 = DataFrame(rng.integers(-10, +10, size=(5,3)), columns=[*"def"])

In [56]:
df1

Unnamed: 0,a,b,c
0,9,-3,3
1,9,3,6
2,3,4,-3
3,7,-8,1
4,4,6,0


In [57]:
df2

Unnamed: 0,d,e,f
0,-3,-4,-2
1,-1,4,7
2,-9,8,0
3,-3,3,1
4,-5,-4,4


If we multiply these, we get all NaNs:

In [58]:
s * df1

Unnamed: 0,a,b,c,0,1,2,3,4
0,,,,,,,,
1,,,,,,,,
2,,,,,,,,
3,,,,,,,,
4,,,,,,,,


If I ```dropna``` that is not going to be helpful to me at all!

Let's see what happend if I add the DataFrames together:

In [59]:
df1 + df2

Unnamed: 0,a,b,c,d,e,f
0,,,,,,
1,,,,,,
2,,,,,,
3,,,,,,
4,,,,,,


They seem to be of the same size, but when I add them together I get all NaNs as well.

So let's just drop the Pandas away and do ```.values```:

In [60]:
df1.values + df2.values

array([[ 6, -7,  1],
       [ 8,  7, 13],
       [-6, 12, -3],
       [ 4, -5,  2],
       [-1,  2,  4]])

If we do the same thing with the Series and a DataFrame, we get a broadcasting error:

In [61]:
try:
    s.values + df1.values
except Exception as ex:
    print(ex)

operands could not be broadcast together with shapes (5,) (5,3) 


Even in the numpy universe our live isn't that easy; if we ask a helpful coworker he will say you will have to use that new axis, you will have to add to the axis to satisfy the broadcast rules:

In [62]:
s.values[:, None] + df1.values

array([[14,  2,  8],
       [ 6,  0,  3],
       [ 2,  3, -4],
       [16,  1, 10],
       [10, 12,  6]])

I would reply that you are speaking a different language and I have no idea of what you are speaking about.

If it's the case that you manage to coerce this to work, you ```.values``` your way to something that actually gives you the answer that you want, but you still want to have that DataFrame for whatever reason, you can take the result and stick it back into a DataFrame, I guess that's not too bad:

In [63]:
DataFrame(df1.values + df2.values)

Unnamed: 0,0,1,2
0,6,-7,1
1,8,7,13
2,-6,12,-3
3,4,-5,2
4,-1,2,4


Doesn't seem like a very powerful reason for us to use Pandas.

In [64]:
#### Or perhaps that we sometimes get totally perplexing results and have to ```.reset_index()``` to coerce the library to 

Maybe is that in addition to that ```.values```, we have to ```.reset_index``` to kind of stay within Pandas and throw away whatever that index is, because that seems to be the source of our problems, and I guess that that might be a compelling reason to use Pandas, just to ```.reset_index()``` our way to success.

We have ```df```, we group by that ```a``` columns, and then 

In [65]:
df1 = DataFrame(rng.integers(-10, +10, size=(5,3)), columns=[*"abc"])
df2 = DataFrame(rng.integers(-10, +10, size=(5,3)), columns=[*"abc"])

In [66]:
df1

Unnamed: 0,a,b,c
0,1,0,-4
1,5,-3,-4
2,7,-5,-6
3,4,2,-10
4,-9,-3,6


In [67]:
df2

Unnamed: 0,a,b,c
0,-2,5,-4
1,-6,5,7
2,-9,-9,3
3,-4,1,-7
4,7,-1,7


In [68]:
df1.groupby("a").sum()

Unnamed: 0_level_0,b,c
a,Unnamed: 1_level_1,Unnamed: 2_level_1
-9,-3,6
1,0,-4
4,2,-10
5,-3,-4
7,-5,-6


This doesn't do much, but at least it will sort by the "a" values.

If we take that and we add it to another DataFrame:

In [69]:
df1.groupby("a").sum() + df2

Unnamed: 0,a,b,c
-9,,,
0,,,
1,,5.0,3.0
2,,,
3,,,
4,,1.0,-3.0
5,,,
7,,,


We end up again with a bunch of NaN values, except for a tiny little slimmer there, but ```.dropna``` is not going to help.

You know what, I don't need to use ```.values```, I just use ```.reset_index``` on one side and get something that looks like what I want (in the ```.groupby``` I might be able to set a parameter to say don't use the index):

In [70]:
df1.groupby("a").sum().reset_index() + df2

Unnamed: 0,a,b,c
0,-11,2,2
1,-5,5,3
2,-5,-7,-7
3,1,-2,-11
4,14,-6,1


#### How many rows do we get (Good tools should be predictable)

Maybe it's because Pandas is so unpredictable in terms of what it will give us when we perform an operation, a good tool should be guessable, we should be able to write some code and have a sense for what it's going to do before we run that line of code. 

Pandas doesn't seem to be the case, because even with something simple, like figuring how many rows I am going to get when doing an operation with two Pandas objects, I can't guess what that is going to be.

Like when I have these two Series, which just contain 4 values, and I sum them up, ok, the result has 4 values, that's not too bad:

In [71]:
s1 = Series(rng.integers(-10, +10, size=4), index=[*"aabb"])
s2 = Series(rng.integers(-10, +10, size=4), index=[*"aabb"])

In [72]:
print(s1, s2, sep="\n\n")

a    5
a    4
b   -6
b    5
dtype: int64

a   -9
a    1
b   -2
b    9
dtype: int64


In [73]:
s1 + s2

a    -4
a     5
b    -8
b    14
dtype: int64

But when I try the same with these 2 other series, which also have 4 values, now I get 6 values. What?

In [74]:
s1 = Series(rng.integers(-10, +10, size=4), index=[*"aaab"])
s2 = Series(rng.integers(-10, +10, size=4), index=[*"abbb"])

In [75]:
print(s1, s2, sep="\n\n")

a   -7
a    8
a   -9
b    2
dtype: int64

a    1
b    7
b   -5
b    8
dtype: int64


In [76]:
s1 + s2

a    -6
a     9
a    -8
b     9
b    -3
b    10
dtype: int64

Again, if we try adding these two other series, which also have 4 values, we get 8 values in the result this time:

In [77]:
s1 = Series(rng.integers(-10, +10, size=4), index=[*"aabb"])
s2 = Series(rng.integers(-10, +10, size=4), index=[*"abbb"])

In [78]:
print(s1, s2, s1 + s2, sep="\n\n")

a    3
a    7
b   -7
b    5
dtype: int64

a     8
b   -10
b    -3
b     2
dtype: int64

a    11
a    15
b   -17
b   -10
b    -5
b    -5
b     2
b     7
dtype: int64


I can't even predict how many values I am going to get when adding these two things. That would never happen with the Python dictionary or list, I can explicitly do how I want to combine these two structures, and I can see immediatly from this code, which might be a little bit clumsy, what is going on. This is a pretty terrible reason why we might want to use Pandas, because we have very little predictability over what is going on.

Maybe what I want to do is ```reset_index``` my way to success here, and then I end up with something that kind of makes sense, except for that column named "index", that is not very useful:

In [79]:
print(s1, s2, s1.reset_index() + s2.reset_index(), sep="\n\n")

a    3
a    7
b   -7
b    5
dtype: int64

a     8
b   -10
b    -3
b     2
dtype: int64

  index   0
0    aa  11
1    ab  -3
2    bb -10
3    bb   7


Instead of ```reset_index```, just throw the index in the garbage, not helpful at all, not really adding anything to my life:

In [80]:
print(s1, s2, s1.reset_index(drop=True) + s2.reset_index(drop=True), sep="\n\n")

a    3
a    7
b   -7
b    5
dtype: int64

a     8
b   -10
b    -3
b     2
dtype: int64

0    11
1    -3
2   -10
3     7
dtype: int64


#### Is it because the API makes weird, incomprehensible distinctions, like the ```.groupby``` operations for user defined

Maybe the reason we use Pandas is because it makes weird, incomprehensible distinctinos in its library.

For example, if we have a DataFrame, and we want to go onto that ```.groupby```, and I want to group by that columns that is not longer unique values, but two values, True of False, the place where "a" is True or False.

When I ```.groupby``` and do a sum, that works:

In [81]:
from numpy import repeat

In [82]:
df = DataFrame({
    "a": repeat([True, False], (size := 8)//2),
    "b": rng.integers(-10, +10, size)
})

In [83]:
print(df, df.groupby("a").sum(), sep="\n\n")

       a  b
0   True -8
1   True  0
2   True  2
3   True  5
4  False  8
5  False -2
6  False -2
7  False -1

       b
a       
False  3
True  -1


But if we want to do another operation, like, say, the kurtosis, that is not built-in into the ```.groupby```, 
we can't do ```.groupby.kurt()```, so we might do a ```.groupby.transform```:

In [84]:
print(df, df.groupby("a").transform(lambda x: x.kurt()), sep="\n\n")

       a  b
0   True -8
1   True  0
2   True  2
3   True  5
4  False  8
5  False -2
6  False -2
7  False -1

          b
0  1.819254
1  1.819254
2  1.819254
3  1.819254
4  3.800222
5  3.800222
6  3.800222
7  3.800222


This gives me, what, 8 rows? What on earth is going on here?

Maybe I'll do a ```.apply```:

In [85]:
print(df, df.groupby("a").apply(lambda x: x.kurt()), sep="\n\n")

       a  b
0   True -8
1   True  0
2   True  2
3   True  5
4  False  8
5  False -2
6  False -2
7  False -1

         a         b
a                   
False  0.0  3.800222
True   0.0  1.819254


That gives me "a" and "b" like that? That's bizarre?

And maybe, eventually, we'll end up doing a ```.agg```:

In [86]:
print(df, df.groupby("a").agg(lambda x: x.kurt()), sep="\n\n")

       a  b
0   True -8
1   True  0
2   True  2
3   True  5
4  False  8
5  False -2
6  False -2
7  False -1

              b
a              
False  3.800222
True   1.819254


That seems to sort of give me the right thing. Whay on earth did they create ```.transform```, ```.apply``` and ```.agg```? Why not have just one? Why not make this easy for me? What on earth does this even mean?

#### Maybe we use Pandas becasue it is full of minor conveniences that allow us to eliminate more or less one line of code.

For instance, we can create a Series and the shift it why one value:

In [87]:
s = Series(rng.integers(-10, +10, size=5))

In [88]:
print(s, s.shift(1), sep="\n\n")

0     9
1    -7
2    -1
3   -10
4    -2
dtype: int64

0     NaN
1     9.0
2    -7.0
3    -1.0
4   -10.0
dtype: float64


Or we can shift it and subtract from itself:

In [89]:
print(s, s.diff(1), sep="\n\n")

0     9
1    -7
2    -1
3   -10
4    -2
dtype: int64

0     NaN
1   -16.0
2     6.0
3    -9.0
4     8.0
dtype: float64


Or we could do ```Series.sum```:

In [90]:
print(s, f"{s.sum() = :.2f}", sep="\n\n")

0     9
1    -7
2    -1
3   -10
4    -2
dtype: int64

s.sum() = -11.00


Or ```Series.mean```:

In [91]:
print(s, f"{s.mean() = :.2f}")

0     9
1    -7
2    -1
3   -10
4    -2
dtype: int64 s.mean() = -2.20


Or ```Series.skew```:

In [92]:
print(s, f"{s.skew() = :.2f}")

0     9
1    -7
2    -1
3   -10
4    -2
dtype: int64 s.skew() = 0.89


Or ```Series.kurt```:

In [93]:
print(s, f"{s.kurt() = :.2f}")

0     9
1    -7
2    -1
3   -10
4    -2
dtype: int64 s.kurt() = 0.99


And between you and me, I don't even know what skew or kurtosis even means. Sure, kurtosis is likek the scaled 4th moment of the distribution? No clue what that actually means. But I do know just have imported this from ```scipy.stats``` and I could just used the ```numpy.ndarray```, and I couls just have used indexing on it to chop off the first element. If I am going to drop de NaN anyway, what's the difference? I could just have done a substraction, that seems much clearer.
The other operations, ```sum``` and ```mean```, are already provided by numpy, and ```skew``` and ```kurtosis```, fine, I've got to eliminate one ```scipy.stats``` import, it doesn't really seem like a very compelling reason for why we should use Pandas:

In [94]:
print(
    (xs := rng.integers(-10, +10, size=5)), 
    f"diff = {xs[:1] - xs[:-1]}",
    f"{xs.sum() = :.2f}",
    f"{xs.mean() = :.2f}",
    sep="\n\n"
)

[ 8  2 -4  9  2]

diff = [ 0  6 12 -1]

xs.sum() = 17.00

xs.mean() = 3.40


#### Is it because the DataFrame gives us the ability to store multiple one-dimensional data sets, at the cost of about 250K lines worth of code complexity?

All this to have a dictionary of ```numpy.ndarray```s

I can do operations down DataFrames, such as sum, create a "sums" column, and will take me as many as 2 or 3 lines of code, maybe a dictionary and a dictionary comprehension if I wanted to do this is pure numpy. And I guess that I can sum accross the columns, and maybe that might maybe take me one more line of code, and for all of this complexity that I am removing, all I have to pay is 250_000 lines of additional code complexity in one of my dependencies, that is not such a high price to pay:

In [95]:
df = DataFrame({
    "a": Series(rng.integers(-10, +10, size=5)),
    "b": Series(rng.integers(-10, +10, size=5)),
    "c": Series(rng.integers(-10, +10, size=5)),
})

In [96]:
print(
    df,
    df.sum(),
    df.sum(axis="columns"),
    df.groupby("a").sum(),
    sep="\n\n"
)

    a  b  c
0   8 -2  5
1 -10 -1 -9
2  -1 -2 -2
3   6  0 -5
4   5 -6  4

a     8
b   -11
c    -7
dtype: int64

0    11
1   -20
2    -5
3     1
4     3
dtype: int64

     b  c
a        
-10 -1 -9
-1  -2 -2
 5  -6  4
 6   0 -5
 8  -2  5


#### 250_000 lines of code?

**My way:**

In [97]:
import pandas
from pathlib import Path

In [98]:
df = DataFrame({
       "file": [file.as_posix() for file in Path(pandas.__file__).parent.glob("**/*.py")],
})

In [99]:
def lines(file):
    with open(file) as f:
        return len(f.readlines())

In [100]:
df["lines"] = df.apply(lambda row: lines(row["file"]), axis=1)

In [101]:
df["lines"].sum()

587437

**James's way**:

In [102]:
from subprocess import run, check_output
from pathlib import Path
from io import StringIO
from pandas import read_csv, MultiIndex, IndexSlice

In [103]:
%%time
# we take the Pandas source code
d = Path("/tmp/Pandas")
if not d.exists():
    run([*"git clone --depth 1 https://github.com/pandas-dev/pandas".split(), d])

CPU times: user 1.24 ms, sys: 1.16 ms, total: 2.39 ms
Wall time: 1.42 ms


In [104]:
# we take every single source file in the pandas directory
# and put the result in a Pandas Series
# and we find the length of each of the files
s = read_csv(
    StringIO(
        check_output("find pandas -type f -iname *.* -exec wc -l {} ;".split(), cwd=d).decode()
    ),
    delimiter=" ",
    skipinitialspace=True, # because lines start with multiple 
                           # spaces
    #engine="python",
    index_col=[1],
    #squeeze=True,         # parameter not supported
    names="lines path".split()
    # remove every other line, the "total" at the end of `wc`
)[::2].pipe(lambda s: s.set_axis(map(Path, s.index))) 

In [105]:
# how many lines of code are in each of those files
# I don't care about all of those files, some of them are tests
print(
    # That gives me a mask, I'll take it and do a .loc operation
    # to just pick out the files that don't belong in those
    # testing directories
    s.loc[    
        s
        # index this and remove any entry where it's from a
        # _testing or tests directory
        .index.map(lambda p: not p.is_relative_to("pandas/_testing") and not p.is_relative_to("pandas/tests"))
    ]
    # I'll group the result by the parent-most directory and
    # the file's suffix, 
    .pipe(
        lambda s:
            s.groupby([
                s.index.map(lambda p: p.parents[1]),
                s.index.map(lambda p: p.suffix),
            ]).sum()
    )
    # I'll unstack this and fill with zeroes if it doesn't 
    # happen to have anything, so that I can get a sense for
    # what's there; there are quite some C code lines
    # ~80k lines in the core and ~70k lines in the libs
    .unstack(fill_value=0)
    # if I sum this it will give me the sum on a pre file type basis
    # 
    .sum()
    # ~222K lines of .py code
    # and if I then sum that up, I get the total #lines of code
    .sum()
    # ~280K
)

0.0


#### And for what?

All that bunch of code for something that doesn't make any sense and that I am struggling with all the time does not seem like a good bet.

Because here is a data frame that I want to update. I do this all the time to Python lists and dictionaries.

In [106]:
df = DataFrame({
    "a": rng.integers(-10, +10, size=(size := 5)),
    "b": rng.integers(-10, +10, size=size),
    "c": rng.choice([*ascii_lowercase], size=size)
})

In [107]:
df

Unnamed: 0,a,b,c
0,4,-8,r
1,4,-8,z
2,8,4,w
3,8,9,a
4,-7,8,d


I'll something very simple, go to that "a" columns, go to that 0 value in that column, and multiply that by 10_000.

In [108]:
df["a"][0] *= 10_000

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["a"][0] *= 10_000


In [109]:
print(
    df,
    sep="\n\n"
)

       a  b  c
0  40000 -8  r
1      4 -8  z
2      8  4  w
3      8  9  a
4     -7  8  d


It seems to work, except that my DataFrame seems to be slightly different with one extra columns, I get a setting with a warning.
250_000 lines of code, and I still have no idea of what's going on.
This does not seem like a compelling reason to use Pandas.

## Why do we even use Pandas?

Our goal here is to try to figure out how do we become a Pandas expert by understanding the core concepts, and I think the core idea is understanding why we use Pandas in the first place.

### We are not here to talk about Numpy

So lets talk about Numpy. Why use Numpy? Why do we even bother with `numpy` in the first place?

Numpy is fast. Python is slow.

Here we have a simple little timer:

In [110]:
from contextlib import contextmanager
from time import perf_counter, sleep

In [111]:
@contextmanager
def timed(msg):
    start = perf_counter()
    try:
        yield
    finally:
        stop = perf_counter()
        print(f"{msg:<24} elapsed \N{mathematical bold capital delta}t: {stop - start:.6f}")

We'll time some; if I sleep for one second, it takes a little bit more than one second:

In [112]:
with timed("sleep one second"):
    sleep(1)

sleep one second         elapsed 𝚫t: 1.003957


You can see it is not a great timer but it's approximately a decent timer.

If I use pure Python, the Python list I am familiar with, and I create two lists size 100_000:

In [113]:
from random import randint as py_randint
from numpy.random import randint as np_randint
from numpy import dot as np_dot

In [114]:
SIZE = 100_000

with timed("py: create"):
    py_xs = [py_randint(-10, +10) for _ in range(SIZE)]
    py_ys = [py_randint(-10, +10) for _ in range(SIZE)]

py: create               elapsed 𝚫t: 0.087105


It takes about half a second (sic).

If I do the same thing with numpy, it takes about a 100 times less (sic):

In [115]:
with timed("py: create"):
    np_xs = np_randint(-10, +10, size=SIZE)
    np_ys = np_randint(-10, +10, size=SIZE)

py: create               elapsed 𝚫t: 0.006808


That is a benefit, fast code is good code.

If I take these operations and I compute their dot product:

In [116]:
py_dot = lambda xs, ys: sum([x * y for x, y in zip(xs, ys)])

with timed("py: dot"):
    py_dot(py_xs, py_ys)

py: dot                  elapsed 𝚫t: 0.019891


It takes about .2 seconds (sic) to do the dot product in pure Python.
I can totally understand what is going on here, take each of the values in x, take each of the values in y, pair them up, multiply them and sum the result, that's it.

But if I do the same think wiht numpy I get an incredible improvement in the speed (sic):

In [117]:
with timed("np: dot"):
    np_dot(np_xs, np_ys)

np: dot                  elapsed 𝚫t: 0.000379


That is 70 times faster. 70 times speedup in some code? That's definitely worth it.

#### `numpy.ndarray` is an interpreted view of raw memory

One of the core ideas I have when I use numpy and I try to motivate why to use numpy is that it provides us with a way to do numerical operations faster because numpy is a restricted computation domain.

Namelly, it is the ability for us to put a manager class around some sort of Python data to intermediate between the Python layer of our code and some code that is implemented in perhaps C or Fortran. Because we have that implementation barrier, we can do things like unboxing values, make them contiguous, exact control of memory, eliminate dynamic dispatch, and as a consequence we are getting 100 times speedups, 80, 70 times speedups, we are getting a significant increase in the speed of the code.

The reason we should think of numpy as a restricted computation environment is, it's its domain, and we have to stay inside that domain, because we lose all that performance if we take our Python dot product and apply it to my numpy data, and cross the boundary of that domain, and it's slower than if I had done it in all pure Python (sic):

In [118]:
with timed("np: dot (py data)"):
    np_dot(py_xs, py_ys)

np: dot (py data)        elapsed 𝚫t: 0.034956


This restricted computation domain idea is very important, because it gives us that fundamental motivation for why do we stay within numpy, why do we stay within pandas, because it's a domain which has intermediated between the pure python level and some implementation level. As long as you stay in that implementation level everything is fine, but if you cross that boundary you are creating a number of additional costs that are going to be worse than if you stayed in one side or the other. It's a domain which gives you certain restrictions to allow you to do computations faster. And, if we think of what numpy really is, it is just an interpreted view of raw memory.

Here we have a ```numpy.ndarray```, and we can dig in it to see that this is actually raw memory at that memory location that we are interpreting in some fashion:

In [119]:
from numpy import array

In [120]:
xs = array([0, 1, 2])
print(
    f"{xs                                = }",
    f"{xs.__array_interface__['data'][0] = :#_x}",
    f"{xs.dtype                          = }",
    f"{xs.shape                          = }",
    f"{xs.strides                        = }",
    sep="\n",
)

xs                                = array([0, 1, 2])
xs.__array_interface__['data'][0] = 0x6000_03b5_4200
xs.dtype                          = dtype('int64')
xs.shape                          = (3,)
xs.strides                        = (8,)


It says, some block at this location contains int64 values, and we interprit it to be 3 values in a one-dimensional structure.
And we have some mechanisms why which in constant time we can move through these values, some striding mechanism. All of these pieces fit together when you think about numpy.

#### numpy provides us with a "mathematical" type-what we would call a "vector" or "matrix" or "tensor"-with corresponding operations

It also provides us with something that is missing from Python.

In other words, if we have our python list, or some sort of opaque collection of items. If we add two lists together, because they are opaque collections of items, stuff we iterate over, addition of these two is just that stuff we iterate over is concatenated; or if we multiply something that is a list, and it's just a bag of stuff, and we just say, give me that bag of stuff 3 times over, repeat this thing:

In [121]:
xs = [1, 2, 3]
ys = [4, 5, 6]

print(f"{xs + ys = }")
print(f"{xs *  3 = }")

xs + ys = [1, 2, 3, 4, 5, 6]
xs *  3 = [1, 2, 3, 1, 2, 3, 1, 2, 3]


Whereas, if we do the same on a ```numpy.ndarray```, we get what we expect to be mathematical operations, vector multiplication, vector addition, this is not opaque, we know what is inside, numeric values, and when you add these together, match them up and add them together, go to each value and element-wise multiply these:

In [122]:
xs = array([1, 2, 3])
ys = array([4, 5, 6])

print(f"{xs + ys = }")
print(f"{xs *  3 = }")

xs + ys = array([5, 7, 9])
xs *  3 = array([3, 6, 9])


That kind of makes sense.

#### ```numpy.ndarray``` provides us with a fixed-size, dynamic-shape, higher-dimensional structure

It is important that we talk about dimensionality, because this is related to one of the core concepts that we have to understand about Pandas, namely, Pandas is one-dimensional data, even though the documentation says it's two-dimensional data, it is really like aligned one-dimensional data.

Let's think about what that means.

Here we have a Python list:

In [123]:
xs = [
    [0, 1, 2],
    [3, 4, 5],
    [6, 7, 8],
]

Many people would think here we have a two dimensinal list, and the reason they say it's a two dimensinal list is, if I wanted to access any individual element, I need two coordinates, and, so you can say, two coordinated to access something makes it two-dimensional:

In [124]:
print(f"{xs[0][1]               = }")

xs[0][1]               = 1


But, in fact, this is not really two-dimensional, while I can access just one row:

In [125]:
print(f"{xs[0]                  = }")

xs[0]                  = [0, 1, 2]


I can't access one column natively without performing some sort of looping, there is no native first-class way of accessing one column:

In [126]:
print(f"{[row[1] for row in xs] = }")

[row[1] for row in xs] = [1, 4, 7]


So yes, there are two coordinates for every value, it is not a two-dimensinal, it is a **nested** list. It is not that I am providing two coordinates at ones, I am providing one coordinate, gives me a nested structure, and then I am providing another coordinate, this is two operations, not one operation, and, so, each layer of that list takes just one coordinate, it is just that this is doing two lookups.

And, additionaly it is a nested structure, so something like this makes total sense in python:

In [127]:
print(f"{xs[0][1] = }")
print(f"{xs[2] = }")
print(f"{xs[2][2][2][0] = }")

xs[0][1] = 1
xs[2] = [6, 7, 8]


TypeError: 'int' object is not subscriptable

This is a list that contains lists, and one of those lists contains lists that contain lists. If we think about this, this doesn't have uniform dimensionalities. Some of the rows of this, some of the dimensions of this have additional dimensions. That doesn't make sense, because when we think about the dimensionality of a structure, it tends to be a property of the structure as a whole, a scalar property, if it is 3-dimensional, everything can be accessed throught 3 coordinates. This is nested data, data where, depending on where you are on the structure, the lookup may require some number of coordinates that are different to the nombre of coordinates in another layer. So it's not quite a two dimensinal structure; in fact a python list is never two-dimensional data, it is always one dimensional data, in just has the ability to sometimes nest data.

NOTE: this structure, a completely valid, reasonable python list, has no mathematical meaning, it is impossible to represent in numpy. What does it even mean mathematically?

In [128]:
xs = [
    [0, 1, 2],
    [3, 4, 5],
    [6, 7, [8, 9, [10]]],
]

This is a matrix where it is sticking out one of the sides of one edge, and it's nested in, it's like a hypercube in one part, but not the rest. No actual mathematical meaning.

Whereas, if we think of **what a python list gives us that a numpy array doesn't** give us, a python list gives us the ability to update the size:

In [129]:
xs = ys = [1, 2, 3, 4]
xs.append(5)
xs.append(6)

print(
    f"{xs = }",
    f"{ys = }",
    sep="\n"
)

xs = [1, 2, 3, 4, 5, 6]
ys = [1, 2, 3, 4, 5, 6]


It is an indirection between the actual storage and this data, so I can take a list and append items at the end and it just works, and I can have two references to that list, both of them.

If I have ```numpy.ndarray``` sure I can do a ```hstack```:

In [130]:
from numpy import hstack

In [131]:
xs = ys = array([0, 1, 2])
xs = hstack([xs, [3]])

print(
    f"{xs                = }",
    f"{ys                = }",
    sep="\n\n"
)

xs                = array([0, 1, 2, 3])

ys                = array([0, 1, 2])


But if I have ```numpy.ndarray```, that ```hstack``` is not updating that ```numpy.ndarray``` directly, because that ```numpy.ndarray``` is an interpreted view of memory, and because of that it's a raw block of memory somewhere, with an interpreter layered on top of it, I can't just resize a raw block of memory, who knows what might be behind it, all that I can do is allocate new memory, copy things over, and, if I have two references, only one of them reflects this ammendment.

What we can see from this is that a ```numpy.ndarray``` cannot be resized, but it can be **reshaped**. 

Python `list`:
    - fixed "shape" (linear)
    - dynamic size

`numpy.ndarray`:
    - fixed size
    - dynamic shape


If we look at the two ndarrays above, we will see two completely different blocks of memory:

In [132]:
print(
    f"{xs.__array_interface__['data'][0] = :#_x}",
    f"{ys.__array_interface__['data'][0] = :#_x}",
    sep = "\n"
)

xs.__array_interface__['data'][0] = 0x6000_03b5_4260
ys.__array_interface__['data'][0] = 0x6000_03b5_4340


This gives us a very clear reasoning as to **why we use numpy**:
- `numpy` gives us a fixed size, dynamic shaped structure, which is a nice correspondence to our python `list`, which is fixed shaped, dynamic sized.
- `numpy` is a mathematical type, `list` is a sort of container type.
- `numpy` gives us fast code

But we're no here to talk about `numpy`, we are here to talk about `pandas`, so let's talk about `xarray` instead, because we are definetly not here to talk about `xarray`.

### Why `xarray`

Let's say we have some sort of real two-dimensional data:

In [133]:
from scipy.spatial.transform import Rotation

In [134]:
img = rng.integers(0, 255, size=(3,3), dtype="uint8")

In [135]:
print(img)

[[ 86 131  22]
 [220   8 116]
 [ 20  20  31]]


Image data is definitely two-dimensional, because I can access any one pixel in that image data by two coordiantes, and I can rotate it and the data means the same thing. I can look at any columnar or row slice or diagonal slice of this, and that's a meaningful operation, and the whole think is homogeneous, this is definitly two-dimensional data.

If I put this into an `ndarray` I can access an individual row:

In [136]:
print(img[0])

[ 86 131  22]


In [137]:
print(img[0,:])

[ 86 131  22]


I can acess one individual column without having to have any additional mechanism:

In [138]:
print(img[:,0])

[ 86 220  20]


You can think that there are many different ways for me to perform these operartions, but, ultimately, the `numpy.ndarray` is a way for me to take n-dimensional data, in this case two-dimensional data, and give me the ability to operate with it.

The reason why I might want to introduce `xarray` is that even though rotations of this data don't really change what this data is, as a human being I want to be able to access data as a human being. I want to talk about the x coordinate and the y coordinate, because that is something I want to have in my code, something that people can understand, and say x=3 and y=4, map it to a physical reality of where this data was actually captured.

I'll take may `numpy.ndarray` and lift it up into a `xarray.DataArray`, and I'll **name the dimensions**.

In [139]:
from xarray import DataArray

In [140]:
img = DataArray(
    data=rng.integers(0, 255, size=(3,3), dtype="uint8"),
    dims=[*"xy"]
)

In [141]:
print(img)

<xarray.DataArray (x: 3, y: 3)>
array([[ 66, 214, 205],
       [210, 239, 184],
       [ 10, 244,  94]], dtype=uint8)
Dimensions without coordinates: x, y


Now I have the exact same `numpy` data, but I've said, one of these dimensions is the x dimension, one of these dimensions is the y dimension.

Then I can do things like:

In [142]:
print(
    img,
    img.sel(x=0),
    sep="\n\n"
)

<xarray.DataArray (x: 3, y: 3)>
array([[ 66, 214, 205],
       [210, 239, 184],
       [ 10, 244,  94]], dtype=uint8)
Dimensions without coordinates: x, y

<xarray.DataArray (y: 3)>
array([ 66, 214, 205], dtype=uint8)
Dimensions without coordinates: y


That is incredibly clear what this means.

In [143]:
print(img.sel(x=0, y=1))

<xarray.DataArray ()>
array(214, dtype=uint8)


An array of a single value, if that makes sense.

I can select arbitrary subsets, arbitrary squares or diagonals of this.
`xarray` is quite useful, gives me the ability to access my data, look up that data, by means of some sort of human metadata, namelly, what the name of the coordinate is.

In [144]:
print(img.sel(x=[0, 1], y=[1, 0]))

<xarray.DataArray (x: 2, y: 2)>
array([[214,  66],
       [239, 210]], dtype=uint8)
Dimensions without coordinates: x, y


When using `numpy` sometimes you have to transpose things because the way in which the data was read-in wasn't quite the way that your code is going to operate on that data, and you have code that hardcodes, look at this axis, and this axis, and this axis, and you need to transpose it to make that work, but with `xarray` you can name these axis, you can name these dimensions, and so any code you write, absent any sort of performance considerations of this, any code you write can just refer to those axis by name and work regardless of whether your data has to be transposed or not.

If I happened to have this data and it was captured from something with a coordinate system; physical data that I am capturing on some sort of detector, and that detector has tickmarks, 10, 20, 30 inches, I can add a coordinate system as well.

In [145]:
from numpy import linspace

In [146]:
img = DataArray(
    data=(data := rng.integers(0, 255, size=(3,3), dtype="uint8")),
    dims=[*"xy"],
    coords={
        "x": linspace(10, 20, data.shape[0]),
        "y": linspace(10, 20, data.shape[1])
    }
)

In [147]:
print(img)

<xarray.DataArray (x: 3, y: 3)>
array([[246, 100,  20],
       [ 37,  95,  61],
       [ 80, 131, 177]], dtype=uint8)
Coordinates:
  * x        (x) float64 10.0 15.0 20.0
  * y        (y) float64 10.0 15.0 20.0


That allows me to say, select my data using this coordinate system. Don't select the first data value that was captured, select the data value at 15:

In [148]:
img.sel(x=15, y=15)

That is very useful.

With `xarray` you can also say, find the data at a given coordinate, and if you can't find that exactly, look for the nearest value:

In [149]:
img.sel(x=15, y=13, method="nearest")

I can even say, select the data at x=15 and y from 12 to 18, and if you are missing data in between do a linear interpolation, incredibly useful, because someone looking at that says, 15, I know exactly where I put that marker on my detector when I was capturing this particular image.

It is the case that when we want to work in `numpy`, because we want to stay in that restricted computation, we will sometimes invent axis that do not represent aspects of the data.

Here we have not one file, but multiple files, and we put this into a larger structure with one additional dimension representing the filename we are operating on. One requirement here is that every file has to be exactly the same size, 3x3, and they have to have a shared coordinate system, every file was captured on the same coordinate system from 10 to 20 with some stepping, and I have the filename associated with each of these files:

In [150]:
img = DataArray(
    data=(data := rng.integers(0, 255, size=(10, 3, 3), dtype="uint8")),
    dims=["filename", *"xy"],
    coords={
        "filename": [Path(f"{x}.bmp") for x in ascii_lowercase[:data.shape[0]]],
        "x": linspace(10, 20, data.shape[1]),
        "y": linspace(10, 20, data.shape[2])
    }
)

In [151]:
print(data)

[[[ 99  69   1]
  [ 93  10 208]
  [209 226  52]]

 [[206  65  97]
  [ 85  22 134]
  [209 136 182]]

 [[196  57  18]
  [124 223 121]
  [141  50 171]]

 [[ 82 169 201]
  [123  58  79]
  [ 21 134 227]]

 [[171  10  71]
  [204 232 127]
  [191  34 248]]

 [[119 107 235]
  [ 88 156 144]
  [247  14  28]]

 [[ 32  67 253]
  [ 68  41 108]
  [246 153 246]]

 [[136 138  83]
  [ 54 167 215]
  [ 62  87 112]]

 [[244  16  20]
  [ 37 155  37]
  [ 86 237 208]]

 [[196  39 176]
  [242 231  93]
  [  9  22  89]]]


This allows me to do things like select x=15, y=15, from the file names "a.bmp":

In [152]:
print(img.sel(x=15, y=15, filename=Path("a.bmp")))

<xarray.DataArray ()>
array(10, dtype=uint8)
Coordinates:
    filename  object a.bmp
    x         float64 15.0
    y         float64 15.0


Incredibly useful, I can actually understand what the heck this is doing.

`xarray` is a very useful tool if I have raw mathematical data, and I want to interact with that mathematical data with some additional metadata that makes it human understandable, namely, the dimensions and the coordinate system on those dimensions.

But we are not here to talk about `xarray`, we are here to talk about Pandas.

### So why Pandas?

Here I have a `pandas.Series`:

In [153]:
s = Series(rng.integers(-10, +10, size=5))

If I look deeper into a `pandas.Series`, I will see that under the covers it has a `PandasArray` (sic):

In [154]:
print(
    s,
    s.array,
    sep="\n\n"
)

0     4
1    -7
2     2
3     0
4   -10
dtype: int64

<NumpyExtensionArray>
[4, -7, 2, 0, -10]
Length: 5, dtype: int64


This array is a small indirection from the values that are stored.

If I go a little deeper in this I can probably even pull out the `numpy` values stored within:

In [155]:
print(s.array._ndarray)

[  4  -7   2   0 -10]


This level of indirection, this **extension array**, gives the ability to mask integer values, because obviously, if I have a floating point value, I can store NaN, represent data that is missing, but I can't quite do that with integers because there is not really an unambiguous in-of-band encoding for that, so for out-of-band encoding I need some sort of masking mechanism.

The thing that we call **the index**, which is probably the most interesting thing about Pandas, is just like that coordinate system in `xarray`, it is a way for us to refer to the value.

A very common way of indexing that we might use is **datetime indexes**:

In [156]:
from pandas import date_range

In [157]:
idx = date_range("2000-01-01", periods=5, name="date")
s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)

In [158]:
print(s)

date
2000-01-01     8
2000-01-02     4
2000-01-03    -4
2000-01-04   -10
2000-01-05    -9
Freq: D, dtype: int64


This isn't just values 5, 8, 8, -5, -8, it is 7 that I captured on the first of January 2000, etc.

If I look at this `pandas.Series`, I can see that it consists of the raw mathematical data that is being stored, and some index that helps me figure out how I translate my human understanding of where this data is captured to where the data is actually stored in memory:

In [159]:
print(
    s.array,
    s.index,
    sep="\n\n"
)

<NumpyExtensionArray>
[8, 4, -4, -10, -9]
Length: 5, dtype: int64

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05'],
              dtype='datetime64[ns]', name='date', freq='D')


I can do things like give the data on the 1st of January, or give me the data until the end of the year, or to up to whenever it's been captured:

In [160]:
print(
    s["2000-01-01"],
    s["2000-01-01":],
    sep="\n\n"
)

8

date
2000-01-01     8
2000-01-02     4
2000-01-03    -4
2000-01-04   -10
2000-01-05    -9
Freq: D, dtype: int64


This works, and is a lot more convenient than doing it as raw matrix operations, specially if I am working with multiple datasets, and they weren't captured over the same timespan, but there is an overlapping timespan that I want to consider.

It turns out the the Pandas index is a way that we label data in order to access it and in order to give additional meaning to manipulations that we perform on this data.

Here we have a `pandas.Series`, indexed with values that appear to be some sort of measurement -20 to 20 with some stepping:

In [161]:
idx = linspace(-20, 20, 5)
s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)

In [162]:
print(
    s,
    s.index,
    s[0],   # the value with label 0 (not the first value, different)
    s[0:3], # the value with labels from 0 up to 3, just the same value 0
    sep="\n\n"
)

-20.0    5
-10.0   -8
 0.0     0
 10.0    7
 20.0    8
dtype: int64

Index([-20.0, -10.0, 0.0, 10.0, 20.0], dtype='float64')

0

0.0    0
dtype: int64


  s[0:3], # the value with labels from 0 up to 3, just the same value 0


If I ask for the value from 0 to 3 in the `PandasArray`, these don't even have an index, these are the physical values, from physical location 0 to physical location 3, versus the values from label location 0 to lable location 3:

In [163]:
print(f"{s.array[0:3] = }\n\n{len(s[0:3]) = }")

s.array[0:3] = <NumpyExtensionArray>
[5, -8, 0]
Length: 3, dtype: int64

len(s[0:3]) = 1


  print(f"{s.array[0:3] = }\n\n{len(s[0:3]) = }")


One of them gives me 3, the other gives me 1.

#### Why `.loc`

It turns out that one unfortunate thing about `pandas.Series` is that the square bracket notation can be a little bit ambigous because if I change this indexing slightly you can see if this indexing is an integer indexing then suddently the square brackets seem to have given me the positions from the physical locations:

In [164]:
idx = linspace(-20, 20, 5, dtype=int)
s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)

In [165]:
print(
    s,
    s.index,
    s[0],   # the value with label 0 (not the first value, different)
    s[0:3], # SEEMS TO HAVE GIVING ME NOW THE VALUES IN PHYSICAL MEMORY
    f"{s.array[0:3] = }\n\n{len(s[0:3]) = }",   
    sep="\n\n"
)

-20   -5
-10   -9
 0    -1
 10    6
 20    2
dtype: int64

Index([-20, -10, 0, 10, 20], dtype='int64')

-1

-20   -5
-10   -9
 0    -1
dtype: int64

s.array[0:3] = <NumpyExtensionArray>
[-5, -9, -1]
Length: 3, dtype: int64

len(s[0:3]) = 3


If I make this floating point values I get something similar to what I had before:

In [166]:
idx = array([-1, 0, 1, 2, 3, 4], dtype=float)
s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)

In [167]:
print(
    s,
    s.index,
    s[0],   # the value with label 0 (not the first value, different)
    s[0:3], # SEEMS GIVE ME SOMETHING SIMILAR TO WHAT I HAD BEFORE
    f"{s.array[0:3] = }\n\n{len(s[0:3]) = }",   
    sep="\n\n"
)

-1.0   -9
 0.0    3
 1.0   -4
 2.0   -6
 3.0   -2
 4.0    7
dtype: int64

Index([-1.0, 0.0, 1.0, 2.0, 3.0, 4.0], dtype='float64')

3

0.0    3
1.0   -4
2.0   -6
3.0   -2
dtype: int64

s.array[0:3] = <NumpyExtensionArray>
[-9, 3, -4]
Length: 3, dtype: int64

len(s[0:3]) = 4


  s[0:3], # SEEMS GIVE ME SOMETHING SIMILAR TO WHAT I HAD BEFORE
  f"{s.array[0:3] = }\n\n{len(s[0:3]) = }",


It's looking up the labels again, but you can see this mismatch occurs.

#### `.loc`

Pandas addressed this and it's the case that we have an unambiguos, specific way for us to access the contents of a `pandas.Series`

In [168]:
idx = linspace(-20, 20, 5, dtype=int)
s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)

In [169]:
print(
    s,
    s.index,
    s.loc[0],   # the value with label 0 (not the first value, different)
    s.loc[0:3], # USING THE INDEX LABELS AS INTENDED, SAME VAL as s.loc[0]
    f"{s.array[0:3] = }\n\n{len(s[0:3]) = }",   
    sep="\n\n"
)

-20    9
-10   -8
 0     1
 10    5
 20   -5
dtype: int64

Index([-20, -10, 0, 10, 20], dtype='int64')

1

0    1
dtype: int64

s.array[0:3] = <NumpyExtensionArray>
[9, -8, 1]
Length: 3, dtype: int64

len(s[0:3]) = 3


#### .iloc

Use it if you don't want to use the index.

The `.loc` is going to be affected by the index, the `.iloc` won't, because it is not going to use the index:

In [170]:
print(
    s,
    s.index,
    s.iloc[0],   # the first value
    s.iloc[0:3], # the first 3 values
    f"{s.array[0:3] = }\n\n{len(s[0:3]) = }",   
    sep="\n\n"
)

-20    9
-10   -8
 0     1
 10    5
 20   -5
dtype: int64

Index([-20, -10, 0, 10, 20], dtype='int64')

9

-20    9
-10   -8
 0     1
dtype: int64

s.array[0:3] = <NumpyExtensionArray>
[9, -8, 1]
Length: 3, dtype: int64

len(s[0:3]) = 3


If we were to compare `.iloc` to the array lookup index we would get something very similar:

In [171]:
print(
    s,
    s.iloc[0:3],
    s.array[0:3],
    sep="\n\n"
)

-20    9
-10   -8
 0     1
 10    5
 20   -5
dtype: int64

-20    9
-10   -8
 0     1
dtype: int64

<NumpyExtensionArray>
[9, -8, 1]
Length: 3, dtype: int64


One if the gives me this index too, but it terms of the values, they both give me the same.

There are a couple of questions that might show up here:
1. Whay does that array brackeing behave the way it does
    - in a moment you will see why that actually makes sense
2. This is using interval notation when you are doing the slicing
    - almost anywhere else where I see an interval notation in Python it doesn't include the endpoint, but this seems to include the endpoint.
    - why on earth does this include the enpoint
    - using `.loc[0:10]` gives me 2 rows if the indexing happens to match up for that, it gives me everything up to and including 10. That is a little bit bizarre:

In [172]:
print(
    s,
    s.loc[0:9],    # got one value, as expected
    s.loc[0:10],   # got two values, BUT EXPECTED ONLY ONE
    s.loc[0:10+1], # got two values, as expected
    sep="\n\n"
)

-20    9
-10   -8
 0     1
 10    5
 20   -5
dtype: int64

0    1
dtype: int64

0     1
10    5
dtype: int64

0     1
10    5
dtype: int64


Before we take a look at that, before we think about that, let's change our index very slightly, an index of letter "a" to "b":

In [173]:
idx = [*"abcde"]
s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)

Be might now be able to reason about this:

In [174]:
print(
    s,
    s.loc["b"],
    s.loc["b":"d"],
    sep="\n\n"
)

a   -5
b   -6
c   -6
d    7
e   -6
dtype: int64

-6

b   -6
c   -6
d    7
dtype: int64


In [175]:
# 37:25

The `.loc` operation uses the index to figure out what value you want, but the index doesn't have to be numeric values, it could just be labels "a", "b", "c", etc, and if we want to include up to the endpoint, we can't just so `.loc["d"+1]` on an arbitrary index, this is just not well defined.

We can see that the `.loc` including the endpoint makes total sense, because it is using the indexing, which is not necesarily going to have some sort of order associated with it, it can have some sort of successor operation, and as a consequence the `.loc` would have to include that endpoint, otherwise you would not necessarily have a way to include that endpoint. What does `+1` mean on a DatetimeIndex, one day, one hour, one minute?

We can see some interesting ways in which these indexes operate.

Here if we ask for "a" or from "a" to "c", depending on that index, it may give us many different things. If our index is like this, where there is a bunch of nonsense in between, it will give me "a" to the end or from "a" to "c" with all the nonsense:

In [176]:
idx = [*".a...c."]
s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)

In [177]:
print(
    s,
    s.loc["a":],
    s.loc["a":"c"],
    sep="\n\n"
)

.   -6
a   -8
.   -8
.    5
.   -5
c    6
.    1
dtype: int64

a   -8
.   -8
.    5
.   -5
c    6
.    1
dtype: int64

a   -8
.   -8
.    5
.   -5
c    6
dtype: int64


If the "a" and the "c" are repeated you can see it's pretty smart, it gives me from the first "a" to the end, or from "a" to "c", from the very first "a" to the very last "c":

In [178]:
idx = [*".aa.cc."]
s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)

In [179]:
print(
    s,
    s.loc["a":],
    s.loc["a":"c"],
    sep="\n\n"
)

.    7
a    1
a    5
.    6
c   -9
c    1
.   -1
dtype: int64

a    1
a    5
.    6
c   -9
c    1
.   -1
dtype: int64

a    1
a    5
.    6
c   -9
c    1
dtype: int64


But if they are interleaved, I get a weird error:

In [180]:
idx = [*".ac.ac."]
s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)

In [181]:
try:
    print(
        s,
        s.loc["a":],
        s.loc["a":"c"],
        sep="\n\n"
    )
except Exception as ex:
    # expected an exception
    print(ex)

"Cannot get left slice bound for non-unique label: 'a'"


But the actually kind of makes sense, this is not a bizzarre crazy error, this is saying you are asking for the values from "a" to "c", but there is an "ac" and an "ac", which "a" did you want, which "c" do you want, you want the very firt "a" and the last "c", that's not quite clear; if the index is in sorted order then there is an unambigous choice, but if not then there may be ambiguoity.

Many of the errors you see in Pandas are actually about warning you against ambibouties which do not have well defined semantics; it turns out that these ambigouties can turn up in different circumstances depending on the indexes.

There are, in fact, many types of indexes that you may encounter.

There are your simple **range indices**:

In [182]:
from pandas import date_range, interval_range, period_range
from numpy import arange

In [183]:
idx = range(0,5)
s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)

In [184]:
print(
    s,
    s.index,
    s.index.array,
    sep="\n\n"
)

0    4
1    9
2    2
3   -3
4   -9
dtype: int64

RangeIndex(start=0, stop=5, step=1)

<NumpyExtensionArray>
[0, 1, 2, 3, 4]
Length: 5, dtype: int64


If you look at that range index and you look under the covers, it is just an `ndarray` with values from 0 upto including 4.

In [185]:
idx = range(0,10,2)
s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)

In [186]:
print(
    s,
    s.index,
    s.index.array,
    sep="\n\n"
)

0     1
2    -6
4     1
6   -10
8     6
dtype: int64

RangeIndex(start=0, stop=10, step=2)

<NumpyExtensionArray>
[0, 2, 4, 6, 8]
Length: 5, dtype: int64


In this index there is stepping.

There is an `int64` index, and you can see that it's a bunch of integers that represent that a given integer maps to a given physical location:

In [187]:
idx = arange(0, 5)
rng.shuffle(idx)
s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)

In [188]:
print(
    s,
    s.index,
    s.index.array,
    sep="\n\n"
)

4     8
2     8
3   -10
1     1
0     6
dtype: int64

Index([4, 2, 3, 1, 0], dtype='int64')

<NumpyExtensionArray>
[4, 2, 3, 1, 0]
Length: 5, dtype: int64


There's a **datetime index**.

You can have a datetime index with a set periodicity:

In [189]:
idx = date_range("2000-01-01", periods=5, name="date")
s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)

In [190]:
print(
    s,
    s.index,
    s.index.array,
    sep="\n\n"
)

date
2000-01-01    5
2000-01-02   -2
2000-01-03    7
2000-01-04    6
2000-01-05   -8
Freq: D, dtype: int64

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05'],
              dtype='datetime64[ns]', name='date', freq='D')

<DatetimeArray>
['2000-01-01 00:00:00', '2000-01-02 00:00:00', '2000-01-03 00:00:00',
 '2000-01-04 00:00:00', '2000-01-05 00:00:00']
Length: 5, dtype: datetime64[ns]


You can change what that frequency is:

In [191]:
idx = date_range("2000-01-01", periods=5, freq="2T", name="date")
s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)

In [192]:
print(
    s,
    s.index,
    s.index.array,
    sep="\n\n"
)

date
2000-01-01 00:00:00   -10
2000-01-01 00:02:00    -8
2000-01-01 00:04:00    -3
2000-01-01 00:06:00    -8
2000-01-01 00:08:00    -9
Freq: 2T, dtype: int64

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:02:00',
               '2000-01-01 00:04:00', '2000-01-01 00:06:00',
               '2000-01-01 00:08:00'],
              dtype='datetime64[ns]', name='date', freq='2T')

<DatetimeArray>
['2000-01-01 00:00:00', '2000-01-01 00:02:00', '2000-01-01 00:04:00',
 '2000-01-01 00:06:00', '2000-01-01 00:08:00']
Length: 5, dtype: datetime64[ns]


You can have an **interval index** that says, if you're looking for a value between or between 0 and 1, 1 and 2, 2 and 3:

In [193]:
idx = interval_range(0, 5, name="value")
s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)

In [194]:
print(
    s,
    s.index,
    s.index.array,
    sep="\n\n"
)

value
(0, 1]   -5
(1, 2]    3
(2, 3]    0
(3, 4]   -5
(4, 5]    9
dtype: int64

IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]', name='value')

<IntervalArray>
[(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]]
Length: 5, dtype: interval[int64, right]


Or between 0 and 2, 2 and 4, ..., with a notation for which side is closed:

In [195]:
idx = interval_range(0, 10, freq=2, closed="left", name="value")
s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)

In [196]:
print(
    s,
    s.index,
    s.index.array,
    sep="\n\n"
)

value
[0, 2)     4
[2, 4)     2
[4, 6)     8
[6, 8)     5
[8, 10)   -8
dtype: int64

IntervalIndex([[0, 2), [2, 4), [4, 6), [6, 8), [8, 10)], dtype='interval[int64, left]', name='value')

<IntervalArray>
[[0, 2), [2, 4), [4, 6), [6, 8), [8, 10)]
Length: 5, dtype: interval[int64, left]


There is a **period index**, for when you are looking for a value a given quarted in this year:

In [197]:
idx = period_range("2000Q1", periods=3, freq="Q", name="quarter")
s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)

In [198]:
print(
    s,
    s.index,
    s.index.array,
    sep="\n\n"
)

quarter
2000Q1   -10
2000Q2     7
2000Q3    -2
Freq: Q-DEC, dtype: int64

PeriodIndex(['2000Q1', '2000Q2', '2000Q3'], dtype='period[Q-DEC]', name='quarter')

<PeriodArray>
['2000Q1', '2000Q2', '2000Q3']
Length: 3, dtype: period[Q-DEC]


There are all these possibilities, and this may lead you to believe, specially if you do a `.array`, that the index is data, but **the Pandas index isn't actually data**, it is something that is convertible to data, it is something that is often backed by data, but **the Pandas index is actually a mechanism**.

Here is my proof for it.

If I have a `RangeIndex` with all the values from 0 up to 100, that's what it looks like:

In [199]:
from pandas import RangeIndex

In [200]:
idx = RangeIndex(0, 100)

In [201]:
print(
    idx,
)

RangeIndex(start=0, stop=100, step=1)


If I have `RangeIndex` with all the values from 0 up to 100_000_000_000_000:

In [202]:
idx = RangeIndex(100_000_000_000_000)

In [203]:
print(idx)

RangeIndex(start=0, stop=100000000000000, step=1)


If it was data, I would have run out of memory, I can't store 100 trillion values, there is no way, but I can create this `RangeIndex`.

An index is a mechanism by which we take a human description of what the data is, the label, the key, and we map it to the physical location; you can think a `RangeIndex` is a very simple way, because whatever you asked for in terms of, give the element at 24, it just gives that back at you, or if it's starting from a different starting point or has a different stepping, it just does a little bit of arithmetic in order to figure out what the corresponding physical location would be if there happens to be some logic here.

When you look at the index, the index has some information about itself.

It knows things, like:

In [204]:
idx = RangeIndex(10, 20, 5)

In [205]:
print(
    idx,
    f"{idx.has_duplicates          = }",  # whether it contains duplicates
    #f"{idx.is_monotonic = }",  # whether it's monotonic (not working)
    f"{idx.is_monotonic_decreasing = }", # in sorted order (decreasing)
    f"{idx.is_monotonic_increasing = }", # in sorted order (increasing)
    f"{idx.is_unique               = }", # whether the values are unique
        # because that is going to determine whether operations that use
        # the index as `.loc` are well defined or not
    f"{idx.get_loc                 = }", # the index is centered around 
        # this method; it says, I am going to translate for you what one key
        # turns into when you want to find a physical position
    f"{idx.get_loc(10)             = }", # get_loc called on this index says, 
        # the value 10 in a range index of 10 to 20 stepped by 5 is going 
        # to be at physical position 0.
        # This is the translation between the human description of what
        # the data is and the physical location of where the data is in 
        # some underlying storage
    f"{idx.slice_locs(10, 13)      = }", # Give me all the values
        # of 10 up to 13 with a human description of it, that is 
        # equivalent of a slice of 0 to 1 on the physical location of the data
    f"{idx.get_indexer(range(10,30,5)) = }", # you can get another
        # indexer, and it will tell how to find these, give me all the
        # values from 10 to 30 stepping by 5, it will say, give me the
        # value at the 0th position, the value at the first position,
        # and those negative ones indicate that that data is not located
        # in the original dataset, and that maybe indicates why we might
        # have to do some mangling(?) for those -1 positions
    
    sep="\n"
)

RangeIndex(start=10, stop=20, step=5)
idx.has_duplicates          = False
idx.is_monotonic_decreasing = False
idx.is_monotonic_increasing = True
idx.is_unique               = True
idx.get_loc                 = <bound method RangeIndex.get_loc of RangeIndex(start=10, stop=20, step=5)>
idx.get_loc(10)             = 0
idx.slice_locs(10, 13)      = (0, 1)
idx.get_indexer(range(10,30,5)) = array([ 0,  1, -1, -1])


The indices, in fact, support a bunch of common operations and a common API.

Here are a bunch of different indices:

In [206]:
#from pandas import Int64Index  # does not exist
#from pandas import Float64Index   # does not exist
from pandas import Index, IntervalIndex, DatetimeIndex, PeriodIndex
from pandas import to_datetime, to_timedelta

In [207]:
indices = [
    (range(5), RangeIndex),
    (range(5, 10), RangeIndex),
    (range(5, 15, 2), RangeIndex),
    #([0, 1, 2, 3, 5], Int64Index),       # does not exist
    #([0, 1, 2, 3, 5.0], Float64Index),   # does not exist
    ([*"abcde"], Index),
    (interval_range(0, 10, 5, closed="left"), IntervalIndex),
    (date_range("2000-01-01", periods=5), DatetimeIndex),
    (to_datetime("2000-01-01") 
         + to_timedelta([0, 0, 1, 1, 2], unit="D"), DatetimeIndex),
    (period_range("2000-01-01", periods=5, freq="Q"), PeriodIndex),    
]

If we iterate through each of these indices and take a look at `.get_loc`, whether it's monotonic, unique, has duplicates; they all do:

In [208]:
for idx, typ in indices:
    s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)
    assert isinstance(s.index, typ), f"{idx =} is not {typ =} but is {type(idx) =}"

In [209]:
for idx, _ in indices:
    s = Series(rng.integers(-10, +10, size=len(idx)), index=idx)
    print(
        s.index,
        f"{s.index.get_loc(s.index[0]) = }",
        f"{s.index.is_monotonic_increasing = }",
        f"{s.index.is_unique = }",
        f"{s.index.has_duplicates = }",
        sep="\n"
    )
    print()

RangeIndex(start=0, stop=5, step=1)
s.index.get_loc(s.index[0]) = 0
s.index.is_monotonic_increasing = True
s.index.is_unique = True
s.index.has_duplicates = False

RangeIndex(start=5, stop=10, step=1)
s.index.get_loc(s.index[0]) = 0
s.index.is_monotonic_increasing = True
s.index.is_unique = True
s.index.has_duplicates = False

RangeIndex(start=5, stop=15, step=2)
s.index.get_loc(s.index[0]) = 0
s.index.is_monotonic_increasing = True
s.index.is_unique = True
s.index.has_duplicates = False

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
s.index.get_loc(s.index[0]) = 0
s.index.is_monotonic_increasing = True
s.index.is_unique = True
s.index.has_duplicates = False

IntervalIndex([[0, 2), [2, 4), [4, 6), [6, 8), [8, 10)], dtype='interval[int64, left]')
s.index.get_loc(s.index[0]) = 0
s.index.is_monotonic_increasing = True
s.index.is_unique = True
s.index.has_duplicates = False

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05'],
            

In [210]:
# 45:20

If we take all these index types and we look at all the methods they support (except for the ones that begin with underscores), and we find the intersection of all of these, we see there is a fairly large surface area for the methods of the common indices that you interact with implement, it's more than just `.is_monotonic_increasing`, `.is_monotonic_decreasing`, there's quite a lot of these:

However, this makes it somewhat difficult for you to create you own mechanisms for looking up the data, to extend the index, specially if you want to do something a little bit unusual.

Here is an attempt to do a very simple symmetric index