### My walk-through of [10 minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html).

### [Object creation](https://pandas.pydata.org/docs/user_guide/10min.html#object-creation)

In [43]:
import numpy as np
import pandas as pd

In [44]:
s = pd.Series([1, 3, 4, np.nan, 6, 8])
s

0    1.0
1    3.0
2    4.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [45]:
s.describe()

count    5.000000
mean     4.400000
std      2.701851
min      1.000000
25%      3.000000
50%      4.000000
75%      6.000000
max      8.000000
dtype: float64

In [46]:
x = pd.date_range('20210101', periods=6)
x

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')

In [47]:
df = pd.DataFrame(np.random.randn(6, 4), index=x, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2021-01-01,-0.280296,-0.6227,-0.740066,-1.804773
2021-01-02,0.448926,-2.132935,-0.237265,-1.216808
2021-01-03,0.931637,0.481341,-1.715976,-0.057648
2021-01-04,-0.915766,0.027777,-0.853713,-0.673066
2021-01-05,1.295416,-0.100129,-0.859383,0.444198
2021-01-06,-1.578108,-1.040209,-0.190125,0.22266


In [48]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.016365,-0.564476,-0.766088,-0.51424
std,1.109378,0.932726,0.553305,0.877998
min,-1.578108,-2.132935,-1.715976,-1.804773
25%,-0.756899,-0.935832,-0.857966,-1.080872
50%,0.084315,-0.361415,-0.79689,-0.365357
75%,0.810959,-0.0042,-0.362965,0.152583
max,1.295416,0.481341,-0.190125,0.444198


Creating a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) by passing a dict of objects that can be converted to series-like.

In [49]:
df2 = pd.DataFrame(
    {
        'A': 1.0,
        'B': pd.Timestamp('20210101'),
        'C': pd.Series(1, index=list(range(4)), dtype='float64'),
        'D': np.array([3]*4, dtype='int64'),
        'E': pd.Categorical(['test', 'train', 'test', 'train']),
        'F': 'foo',
    }
)

df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2021-01-01,1.0,3,test,foo
1,1.0,2021-01-01,1.0,3,train,foo
2,1.0,2021-01-01,1.0,3,test,foo
3,1.0,2021-01-01,1.0,3,train,foo


Interesting. A single entry (such as 1 of column A, timestamp of column B, and so on) are just extended.

In this example, the number of records seems to be determined by this line:

```
        'C': pd.Series(1, index=list(range(4)), dtype='float64'),
```

This determines the number of records of this DataFrame is four.

Columns A, B, and F are just a single entry. They are made to a [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series). Scalar `1` is turned to `[1, 1, 1, 1]`.


If you give 5 as multiplier on column D:
```
        'D': np.array([3]*5, dtype='int64'),
```

Pandas complains that the number of records is not consistent with other part of code.

Also, Pandas cannot extend a list. If you give this on column E:
```
        'E': pd.Categorical(['test', 'train']),
```
Pandas does not bother to repeat this 2-entry list and make it four.


This is much like broadcasting of Excel [Dynamic Arrays](https://techcommunity.microsoft.com/t5/excel-blog/preview-of-dynamic-arrays-in-excel/ba-p/252944).

[R](https://www.r-project.org/) takes a different approach when it finds a situation like this. It tries to extend/multiply to make its series fit to the entire dataframe.

Different languages takes different approaches. Very interesting.

In [50]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float64
D             int64
E          category
F            object
dtype: object

### [Viewing data](https://pandas.pydata.org/docs/user_guide/10min.html#viewing-data)

In [51]:
df.index

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')

In [52]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [53]:
# Transposing

df.T

Unnamed: 0,2021-01-01,2021-01-02,2021-01-03,2021-01-04,2021-01-05,2021-01-06
A,-0.280296,0.448926,0.931637,-0.915766,1.295416,-1.578108
B,-0.6227,-2.132935,0.481341,0.027777,-0.100129,-1.040209
C,-0.740066,-0.237265,-1.715976,-0.853713,-0.859383,-0.190125
D,-1.804773,-1.216808,-0.057648,-0.673066,0.444198,0.22266


In [54]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2021-01-02,0.448926,-2.132935,-0.237265,-1.216808
2021-01-06,-1.578108,-1.040209,-0.190125,0.22266
2021-01-01,-0.280296,-0.6227,-0.740066,-1.804773
2021-01-05,1.295416,-0.100129,-0.859383,0.444198
2021-01-04,-0.915766,0.027777,-0.853713,-0.673066
2021-01-03,0.931637,0.481341,-1.715976,-0.057648


### [Selection](https://pandas.pydata.org/docs/user_guide/10min.html#selection)

In [55]:
### Selecting one column
df['A']
### This returns a Series.

2021-01-01   -0.280296
2021-01-02    0.448926
2021-01-03    0.931637
2021-01-04   -0.915766
2021-01-05    1.295416
2021-01-06   -1.578108
Freq: D, Name: A, dtype: float64

In [56]:
df.A

2021-01-01   -0.280296
2021-01-02    0.448926
2021-01-03    0.931637
2021-01-04   -0.915766
2021-01-05    1.295416
2021-01-06   -1.578108
Freq: D, Name: A, dtype: float64

In [57]:
### Selecting multiple columns
df[['B', 'D']]

Unnamed: 0,B,D
2021-01-01,-0.6227,-1.804773
2021-01-02,-2.132935,-1.216808
2021-01-03,0.481341,-0.057648
2021-01-04,0.027777,-0.673066
2021-01-05,-0.100129,0.444198
2021-01-06,-1.040209,0.22266


In [58]:
### Slicing also works on rows. Interesting at best, confusing for beginners.
df[0:2]

Unnamed: 0,A,B,C,D
2021-01-01,-0.280296,-0.6227,-0.740066,-1.804773
2021-01-02,0.448926,-2.132935,-0.237265,-1.216808


In [59]:
### Also works on index. Wow...
df['20210102':'20210104']

Unnamed: 0,A,B,C,D
2021-01-02,0.448926,-2.132935,-0.237265,-1.216808
2021-01-03,0.931637,0.481341,-1.715976,-0.057648
2021-01-04,-0.915766,0.027777,-0.853713,-0.673066


### [Selection by label](https://pandas.pydata.org/docs/user_guide/10min.html#selection-by-label)

[pandas.DataFrame.loc()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) has a variety of usages.

In [60]:
# selecting a row
df.loc[x[1]]

A    0.448926
B   -2.132935
C   -0.237265
D   -1.216808
Name: 2021-01-02 00:00:00, dtype: float64

In [61]:
df.loc[:, ('B', 'D')]

Unnamed: 0,B,D
2021-01-01,-0.6227,-1.804773
2021-01-02,-2.132935,-1.216808
2021-01-03,0.481341,-0.057648
2021-01-04,0.027777,-0.673066
2021-01-05,-0.100129,0.444198
2021-01-06,-1.040209,0.22266


In [62]:
df.loc[:, 'B':'D']

Unnamed: 0,B,C,D
2021-01-01,-0.6227,-0.740066,-1.804773
2021-01-02,-2.132935,-0.237265,-1.216808
2021-01-03,0.481341,-1.715976,-0.057648
2021-01-04,0.027777,-0.853713,-0.673066
2021-01-05,-0.100129,-0.859383,0.444198
2021-01-06,-1.040209,-0.190125,0.22266


In [63]:
df.loc['20210101':'20210103', ('A', 'C')]

Unnamed: 0,A,C
2021-01-01,-0.280296,-0.740066
2021-01-02,0.448926,-0.237265
2021-01-03,0.931637,-1.715976


okay, so DataFrame.loc() takes row as first argument, columns as second. Looks like that.

### [Seelction by position](https://pandas.pydata.org/docs/user_guide/10min.html#selection-by-position)

[pandas.DataFrame.iloc()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) lets you select by integer index.

### [Boolean indexing](https://pandas.pydata.org/docs/user_guide/10min.html#boolean-indexing)

In [64]:
df['A'] > 0

2021-01-01    False
2021-01-02     True
2021-01-03     True
2021-01-04    False
2021-01-05     True
2021-01-06    False
Freq: D, Name: A, dtype: bool

In [65]:
df[df['A']>0]
# In this tutorial, this is called a where operation.

Unnamed: 0,A,B,C,D
2021-01-02,0.448926,-2.132935,-0.237265,-1.216808
2021-01-03,0.931637,0.481341,-1.715976,-0.057648
2021-01-05,1.295416,-0.100129,-0.859383,0.444198


In [66]:
df2 = df.copy()
df2

Unnamed: 0,A,B,C,D
2021-01-01,-0.280296,-0.6227,-0.740066,-1.804773
2021-01-02,0.448926,-2.132935,-0.237265,-1.216808
2021-01-03,0.931637,0.481341,-1.715976,-0.057648
2021-01-04,-0.915766,0.027777,-0.853713,-0.673066
2021-01-05,1.295416,-0.100129,-0.859383,0.444198
2021-01-06,-1.578108,-1.040209,-0.190125,0.22266


In [67]:
df2['E'] = ('one', 'one', 'two', 'three', 'four', 'three')
df2

Unnamed: 0,A,B,C,D,E
2021-01-01,-0.280296,-0.6227,-0.740066,-1.804773,one
2021-01-02,0.448926,-2.132935,-0.237265,-1.216808,one
2021-01-03,0.931637,0.481341,-1.715976,-0.057648,two
2021-01-04,-0.915766,0.027777,-0.853713,-0.673066,three
2021-01-05,1.295416,-0.100129,-0.859383,0.444198,four
2021-01-06,-1.578108,-1.040209,-0.190125,0.22266,three


In [68]:
df2[df2['E'].isin(('two', 'three'))]

Unnamed: 0,A,B,C,D,E
2021-01-03,0.931637,0.481341,-1.715976,-0.057648,two
2021-01-04,-0.915766,0.027777,-0.853713,-0.673066,three
2021-01-06,-1.578108,-1.040209,-0.190125,0.22266,three


### [Setting](https://pandas.pydata.org/docs/user_guide/10min.html#setting)

In [69]:
s1 = pd.Series((20, 30, 40, 50, 60, 70), index=pd.date_range('20210102', periods=6))
s1

# Note, Series or DataFrame basically need to have index attached to it.

2021-01-02    20
2021-01-03    30
2021-01-04    40
2021-01-05    50
2021-01-06    60
2021-01-07    70
Freq: D, dtype: int64

In [70]:
df['F'] = s1
df

Unnamed: 0,A,B,C,D,F
2021-01-01,-0.280296,-0.6227,-0.740066,-1.804773,
2021-01-02,0.448926,-2.132935,-0.237265,-1.216808,20.0
2021-01-03,0.931637,0.481341,-1.715976,-0.057648,30.0
2021-01-04,-0.915766,0.027777,-0.853713,-0.673066,40.0
2021-01-05,1.295416,-0.100129,-0.859383,0.444198,50.0
2021-01-06,-1.578108,-1.040209,-0.190125,0.22266,60.0


Okay, note that entry of index `2021-01-07` in Series `s1` has been dropped when it is added to DataFrame `df`. This means what's important is the index of the DataFrame on the left hand side.

In [71]:
# Set values by label
df.at[x[2], 'C'] = 100
df

Unnamed: 0,A,B,C,D,F
2021-01-01,-0.280296,-0.6227,-0.740066,-1.804773,
2021-01-02,0.448926,-2.132935,-0.237265,-1.216808,20.0
2021-01-03,0.931637,0.481341,100.0,-0.057648,30.0
2021-01-04,-0.915766,0.027777,-0.853713,-0.673066,40.0
2021-01-05,1.295416,-0.100129,-0.859383,0.444198,50.0
2021-01-06,-1.578108,-1.040209,-0.190125,0.22266,60.0


In [72]:
# iat let you specify by integer index
df.iat[2, 2]

100.0

In [73]:
df.iat[2, 2] = 150
df

Unnamed: 0,A,B,C,D,F
2021-01-01,-0.280296,-0.6227,-0.740066,-1.804773,
2021-01-02,0.448926,-2.132935,-0.237265,-1.216808,20.0
2021-01-03,0.931637,0.481341,150.0,-0.057648,30.0
2021-01-04,-0.915766,0.027777,-0.853713,-0.673066,40.0
2021-01-05,1.295416,-0.100129,-0.859383,0.444198,50.0
2021-01-06,-1.578108,-1.040209,-0.190125,0.22266,60.0


### [Missing data](https://pandas.pydata.org/docs/user_guide/10min.html#missing-data)

In [74]:
df1 = df.reindex(index=x[0:4], columns=list(df.columns) + ['E'])
df1.loc[x[0] : x[1], 'E'] = 1
df1

Unnamed: 0,A,B,C,D,F,E
2021-01-01,-0.280296,-0.6227,-0.740066,-1.804773,,1.0
2021-01-02,0.448926,-2.132935,-0.237265,-1.216808,20.0,1.0
2021-01-03,0.931637,0.481341,150.0,-0.057648,30.0,
2021-01-04,-0.915766,0.027777,-0.853713,-0.673066,40.0,


In [75]:
df1.dropna(how='any')

Unnamed: 0,A,B,C,D,F,E
2021-01-02,0.448926,-2.132935,-0.237265,-1.216808,20.0,1.0


In [76]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,F,E
2021-01-01,-0.280296,-0.6227,-0.740066,-1.804773,5.0,1.0
2021-01-02,0.448926,-2.132935,-0.237265,-1.216808,20.0,1.0
2021-01-03,0.931637,0.481341,150.0,-0.057648,30.0,5.0
2021-01-04,-0.915766,0.027777,-0.853713,-0.673066,40.0,5.0


### [Operations > Stats](https://pandas.pydata.org/docs/user_guide/10min.html#stats)

In [77]:
## Operations in general exclude missing data.
df.mean()

A    -0.016365
B    -0.564476
C    24.519908
D    -0.514240
F    40.000000
dtype: float64

In [78]:
# Operations run on axis 0 by default.
df.mean(axis=0)

A    -0.016365
B    -0.564476
C    24.519908
D    -0.514240
F    40.000000
dtype: float64

In [79]:
# And you can run on axis 1.
df.mean(axis=1)

2021-01-01    -0.861959
2021-01-02     3.372384
2021-01-03    36.271066
2021-01-04     7.517046
2021-01-05    10.156020
2021-01-06    11.482843
Freq: D, dtype: float64

In [80]:
df['A'].shift(10)

2021-01-01   NaN
2021-01-02   NaN
2021-01-03   NaN
2021-01-04   NaN
2021-01-05   NaN
2021-01-06   NaN
Freq: D, Name: A, dtype: float64

In [81]:
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=x)
s

2021-01-01    1.0
2021-01-02    3.0
2021-01-03    5.0
2021-01-04    NaN
2021-01-05    6.0
2021-01-06    8.0
Freq: D, dtype: float64

In [82]:
# shift shifts elements
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html
s = s.shift(1)
s

2021-01-01    NaN
2021-01-02    1.0
2021-01-03    3.0
2021-01-04    5.0
2021-01-05    NaN
2021-01-06    6.0
Freq: D, dtype: float64

In [83]:
# pandas automatically broadcasts. For example, if you subtract a Series from a DataFrame, pandas subtracts a Series from all columns of the DataFrame

# pandas.DataFrame.sub subtract
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sub.html
df.sub(s, axis=0)

# axis=0 is defalut, by-index direction

Unnamed: 0,A,B,C,D,F
2021-01-01,,,,,
2021-01-02,-0.551074,-3.132935,-1.237265,-2.216808,19.0
2021-01-03,-2.068363,-2.518659,147.0,-3.057648,27.0
2021-01-04,-5.915766,-4.972223,-5.853713,-5.673066,35.0
2021-01-05,,,,,
2021-01-06,-7.578108,-7.040209,-6.190125,-5.77734,54.0


### [Apply](https://pandas.pydata.org/docs/user_guide/10min.html#apply)
Applies functions to data.

In [84]:
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D,F
2021-01-01,-0.280296,-0.6227,-0.740066,-1.804773,
2021-01-02,0.16863,-2.755635,-0.977331,-3.021581,20.0
2021-01-03,1.100267,-2.274293,149.022669,-3.079229,50.0
2021-01-04,0.1845,-2.246517,148.168955,-3.752295,90.0
2021-01-05,1.479916,-2.346646,147.309572,-3.308097,140.0
2021-01-06,-0.098192,-3.386855,147.119447,-3.085438,200.0


In [86]:
df.apply(lambda x: x.max() - x.min() )

A      2.873524
B      2.614276
C    150.859383
D      2.248971
F     40.000000
dtype: float64