### My walk-through of [10 minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html).

### [Object creation](https://pandas.pydata.org/docs/user_guide/10min.html#object-creation)

In [1]:
import numpy as np
import pandas as pd

In [2]:
s = pd.Series([1, 3, 4, np.nan, 6, 8])
s

0    1.0
1    3.0
2    4.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [3]:
s.describe()

count    5.000000
mean     4.400000
std      2.701851
min      1.000000
25%      3.000000
50%      4.000000
75%      6.000000
max      8.000000
dtype: float64

In [4]:
x = pd.date_range('20210101', periods=6)
x

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')

In [5]:
df = pd.DataFrame(np.random.randn(6, 4), index=x, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2021-01-01,-1.941678,0.07627,-0.32577,-0.415459
2021-01-02,1.326525,-0.214701,0.022079,0.887874
2021-01-03,0.891641,1.239267,0.220368,0.540827
2021-01-04,-0.229054,0.233816,-1.870031,0.603779
2021-01-05,-1.188709,0.204488,-0.322705,-0.724128
2021-01-06,0.346631,2.239994,2.285553,-0.229533


In [6]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.132441,0.629855,0.001582,0.11056
std,1.249417,0.929671,1.340154,0.65141
min,-1.941678,-0.214701,-1.870031,-0.724128
25%,-0.948795,0.108324,-0.325004,-0.368977
50%,0.058789,0.219152,-0.150313,0.155647
75%,0.755389,0.987904,0.170796,0.588041
max,1.326525,2.239994,2.285553,0.887874


Creating a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) by passing a dict of objects that can be converted to series-like.

In [7]:
df2 = pd.DataFrame(
    {
        'A': 1.0,
        'B': pd.Timestamp('20210101'),
        'C': pd.Series(1, index=list(range(4)), dtype='float64'),
        'D': np.array([3]*4, dtype='int64'),
        'E': pd.Categorical(['test', 'train', 'test', 'train']),
        'F': 'foo',
    }
)

df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2021-01-01,1.0,3,test,foo
1,1.0,2021-01-01,1.0,3,train,foo
2,1.0,2021-01-01,1.0,3,test,foo
3,1.0,2021-01-01,1.0,3,train,foo


Interesting. A single entry (such as 1 of column A, timestamp of column B, and so on) are just extended.

In this example, the number of records seems to be determined by this line:

```
        'C': pd.Series(1, index=list(range(4)), dtype='float64'),
```

This determines the number of records of this DataFrame is four.

Columns A, B, and F are just a single entry. They are made to a [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series). Scalar `1` is turned to `[1, 1, 1, 1]`.


If you give 5 as multiplier on column D:
```
        'D': np.array([3]*5, dtype='int64'),
```

Pandas complains that the number of records is not consistent with other part of code.

Also, Pandas cannot extend a list. If you give this on column E:
```
        'E': pd.Categorical(['test', 'train']),
```
Pandas does not bother to repeat this 2-entry list and make it four.


This is much like broadcasting of Excel [Dynamic Arrays](https://techcommunity.microsoft.com/t5/excel-blog/preview-of-dynamic-arrays-in-excel/ba-p/252944).

[R](https://www.r-project.org/) takes a different approach when it finds a situation like this. It tries to extend/multiply to make its series fit to the entire dataframe.

Different languages takes different approaches. Very interesting.

In [8]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float64
D             int64
E          category
F            object
dtype: object

### [Viewing data](https://pandas.pydata.org/docs/user_guide/10min.html#viewing-data)

In [9]:
df.index

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')

In [10]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [11]:
# Transposing

df.T

Unnamed: 0,2021-01-01,2021-01-02,2021-01-03,2021-01-04,2021-01-05,2021-01-06
A,-1.941678,1.326525,0.891641,-0.229054,-1.188709,0.346631
B,0.07627,-0.214701,1.239267,0.233816,0.204488,2.239994
C,-0.32577,0.022079,0.220368,-1.870031,-0.322705,2.285553
D,-0.415459,0.887874,0.540827,0.603779,-0.724128,-0.229533


In [12]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2021-01-02,1.326525,-0.214701,0.022079,0.887874
2021-01-01,-1.941678,0.07627,-0.32577,-0.415459
2021-01-05,-1.188709,0.204488,-0.322705,-0.724128
2021-01-04,-0.229054,0.233816,-1.870031,0.603779
2021-01-03,0.891641,1.239267,0.220368,0.540827
2021-01-06,0.346631,2.239994,2.285553,-0.229533


### [Selection](https://pandas.pydata.org/docs/user_guide/10min.html#selection)

In [13]:
### Selecting one column
df['A']
### This returns a Series.

2021-01-01   -1.941678
2021-01-02    1.326525
2021-01-03    0.891641
2021-01-04   -0.229054
2021-01-05   -1.188709
2021-01-06    0.346631
Freq: D, Name: A, dtype: float64

In [14]:
df.A

2021-01-01   -1.941678
2021-01-02    1.326525
2021-01-03    0.891641
2021-01-04   -0.229054
2021-01-05   -1.188709
2021-01-06    0.346631
Freq: D, Name: A, dtype: float64

In [15]:
### Selecting multiple columns
df[['B', 'D']]

Unnamed: 0,B,D
2021-01-01,0.07627,-0.415459
2021-01-02,-0.214701,0.887874
2021-01-03,1.239267,0.540827
2021-01-04,0.233816,0.603779
2021-01-05,0.204488,-0.724128
2021-01-06,2.239994,-0.229533


In [16]:
### Slicing also works on rows. Interesting at best, confusing for beginners.
df[0:2]

Unnamed: 0,A,B,C,D
2021-01-01,-1.941678,0.07627,-0.32577,-0.415459
2021-01-02,1.326525,-0.214701,0.022079,0.887874


In [17]:
### Also works on index. Wow...
df['20210102':'20210104']

Unnamed: 0,A,B,C,D
2021-01-02,1.326525,-0.214701,0.022079,0.887874
2021-01-03,0.891641,1.239267,0.220368,0.540827
2021-01-04,-0.229054,0.233816,-1.870031,0.603779


### [Selection by label](https://pandas.pydata.org/docs/user_guide/10min.html#selection-by-label)

[pandas.DataFrame.loc()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) has a variety of usages.

In [18]:
# selecting a row
df.loc[x[1]]

A    1.326525
B   -0.214701
C    0.022079
D    0.887874
Name: 2021-01-02 00:00:00, dtype: float64

In [19]:
df.loc[:, ('B', 'D')]

Unnamed: 0,B,D
2021-01-01,0.07627,-0.415459
2021-01-02,-0.214701,0.887874
2021-01-03,1.239267,0.540827
2021-01-04,0.233816,0.603779
2021-01-05,0.204488,-0.724128
2021-01-06,2.239994,-0.229533


In [20]:
df.loc[:, 'B':'D']

Unnamed: 0,B,C,D
2021-01-01,0.07627,-0.32577,-0.415459
2021-01-02,-0.214701,0.022079,0.887874
2021-01-03,1.239267,0.220368,0.540827
2021-01-04,0.233816,-1.870031,0.603779
2021-01-05,0.204488,-0.322705,-0.724128
2021-01-06,2.239994,2.285553,-0.229533


In [21]:
df.loc['20210101':'20210103', ('A', 'C')]

Unnamed: 0,A,C
2021-01-01,-1.941678,-0.32577
2021-01-02,1.326525,0.022079
2021-01-03,0.891641,0.220368


okay, so DataFrame.loc() takes row as first argument, columns as second. Looks like that.

### [Seelction by position](https://pandas.pydata.org/docs/user_guide/10min.html#selection-by-position)

[pandas.DataFrame.iloc()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) lets you select by integer index.

### [Boolean indexing](https://pandas.pydata.org/docs/user_guide/10min.html#boolean-indexing)

In [22]:
df['A'] > 0

2021-01-01    False
2021-01-02     True
2021-01-03     True
2021-01-04    False
2021-01-05    False
2021-01-06     True
Freq: D, Name: A, dtype: bool

In [23]:
df[df['A']>0]
# In this tutorial, this is called a where operation.

Unnamed: 0,A,B,C,D
2021-01-02,1.326525,-0.214701,0.022079,0.887874
2021-01-03,0.891641,1.239267,0.220368,0.540827
2021-01-06,0.346631,2.239994,2.285553,-0.229533


In [24]:
df2 = df.copy()
df2

Unnamed: 0,A,B,C,D
2021-01-01,-1.941678,0.07627,-0.32577,-0.415459
2021-01-02,1.326525,-0.214701,0.022079,0.887874
2021-01-03,0.891641,1.239267,0.220368,0.540827
2021-01-04,-0.229054,0.233816,-1.870031,0.603779
2021-01-05,-1.188709,0.204488,-0.322705,-0.724128
2021-01-06,0.346631,2.239994,2.285553,-0.229533


In [25]:
df2['E'] = ('one', 'one', 'two', 'three', 'four', 'three')
df2

Unnamed: 0,A,B,C,D,E
2021-01-01,-1.941678,0.07627,-0.32577,-0.415459,one
2021-01-02,1.326525,-0.214701,0.022079,0.887874,one
2021-01-03,0.891641,1.239267,0.220368,0.540827,two
2021-01-04,-0.229054,0.233816,-1.870031,0.603779,three
2021-01-05,-1.188709,0.204488,-0.322705,-0.724128,four
2021-01-06,0.346631,2.239994,2.285553,-0.229533,three


In [26]:
df2[df2['E'].isin(('two', 'three'))]

Unnamed: 0,A,B,C,D,E
2021-01-03,0.891641,1.239267,0.220368,0.540827,two
2021-01-04,-0.229054,0.233816,-1.870031,0.603779,three
2021-01-06,0.346631,2.239994,2.285553,-0.229533,three


### [Setting](https://pandas.pydata.org/docs/user_guide/10min.html#setting)

In [27]:
s1 = pd.Series((20, 30, 40, 50, 60, 70), index=pd.date_range('20210102', periods=6))
s1

# Note, Series or DataFrame basically need to have index attached to it.

2021-01-02    20
2021-01-03    30
2021-01-04    40
2021-01-05    50
2021-01-06    60
2021-01-07    70
Freq: D, dtype: int64

In [28]:
df['F'] = s1
df

Unnamed: 0,A,B,C,D,F
2021-01-01,-1.941678,0.07627,-0.32577,-0.415459,
2021-01-02,1.326525,-0.214701,0.022079,0.887874,20.0
2021-01-03,0.891641,1.239267,0.220368,0.540827,30.0
2021-01-04,-0.229054,0.233816,-1.870031,0.603779,40.0
2021-01-05,-1.188709,0.204488,-0.322705,-0.724128,50.0
2021-01-06,0.346631,2.239994,2.285553,-0.229533,60.0


Okay, note that entry of index `2021-01-07` in Series `s1` has been dropped when it is added to DataFrame `df`. This means what's important is the index of the DataFrame on the left hand side.

In [29]:
# Set values by label
df.at[x[2], 'C'] = 100
df

Unnamed: 0,A,B,C,D,F
2021-01-01,-1.941678,0.07627,-0.32577,-0.415459,
2021-01-02,1.326525,-0.214701,0.022079,0.887874,20.0
2021-01-03,0.891641,1.239267,100.0,0.540827,30.0
2021-01-04,-0.229054,0.233816,-1.870031,0.603779,40.0
2021-01-05,-1.188709,0.204488,-0.322705,-0.724128,50.0
2021-01-06,0.346631,2.239994,2.285553,-0.229533,60.0


In [30]:
# iat let you specify by integer index
df.iat[2, 2]

100.0

In [31]:
df.iat[2, 2] = 150
df

Unnamed: 0,A,B,C,D,F
2021-01-01,-1.941678,0.07627,-0.32577,-0.415459,
2021-01-02,1.326525,-0.214701,0.022079,0.887874,20.0
2021-01-03,0.891641,1.239267,150.0,0.540827,30.0
2021-01-04,-0.229054,0.233816,-1.870031,0.603779,40.0
2021-01-05,-1.188709,0.204488,-0.322705,-0.724128,50.0
2021-01-06,0.346631,2.239994,2.285553,-0.229533,60.0


### [Missing data](https://pandas.pydata.org/docs/user_guide/10min.html#missing-data)

In [32]:
df1 = df.reindex(index=x[0:4], columns=list(df.columns) + ['E'])
df1.loc[x[0] : x[1], 'E'] = 1
df1

Unnamed: 0,A,B,C,D,F,E
2021-01-01,-1.941678,0.07627,-0.32577,-0.415459,,1.0
2021-01-02,1.326525,-0.214701,0.022079,0.887874,20.0,1.0
2021-01-03,0.891641,1.239267,150.0,0.540827,30.0,
2021-01-04,-0.229054,0.233816,-1.870031,0.603779,40.0,


In [33]:
df1.dropna(how='any')

Unnamed: 0,A,B,C,D,F,E
2021-01-02,1.326525,-0.214701,0.022079,0.887874,20.0,1.0


In [34]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,F,E
2021-01-01,-1.941678,0.07627,-0.32577,-0.415459,5.0,1.0
2021-01-02,1.326525,-0.214701,0.022079,0.887874,20.0,1.0
2021-01-03,0.891641,1.239267,150.0,0.540827,30.0,5.0
2021-01-04,-0.229054,0.233816,-1.870031,0.603779,40.0,5.0


### [Operations > Stats](https://pandas.pydata.org/docs/user_guide/10min.html#stats)

In [36]:
## Operations in general exclude missing data.
df.mean()

A    -0.132441
B     0.629855
C    24.964854
D     0.110560
F    40.000000
dtype: float64

In [37]:
# Operations run on axis 0 by default.
df.mean(axis=0)

A    -0.132441
B     0.629855
C    24.964854
D     0.110560
F    40.000000
dtype: float64

In [38]:
# And you can run on axis 1.
df.mean(axis=1)

2021-01-01    -0.651659
2021-01-02     4.404355
2021-01-03    36.534347
2021-01-04     7.747702
2021-01-05     9.593789
2021-01-06    12.928529
Freq: D, dtype: float64