### My walk-through of [10 minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html).

### [Object creation](https://pandas.pydata.org/docs/user_guide/10min.html#object-creation)

In [9]:
import numpy as np
import pandas as pd

In [10]:
s = pd.Series([1, 3, 4, np.nan, 6, 8])
s

0    1.0
1    3.0
2    4.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [11]:
s.describe()

count    5.000000
mean     4.400000
std      2.701851
min      1.000000
25%      3.000000
50%      4.000000
75%      6.000000
max      8.000000
dtype: float64

In [12]:
x = pd.date_range('20210101', periods=6)
x

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')

In [13]:
df = pd.DataFrame(np.random.randn(6, 4), index=x, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2021-01-01,-0.965594,0.60675,0.30955,-0.383379
2021-01-02,-1.15589,1.427181,1.058731,-0.869181
2021-01-03,0.64454,-0.64033,1.310864,1.217529
2021-01-04,-0.373512,-0.740447,-0.38654,-1.152981
2021-01-05,0.149764,-0.369635,-0.980854,0.398618
2021-01-06,-0.87573,1.65922,-0.521478,-1.199054


In [14]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.429404,0.32379,0.131712,-0.331408
std,0.708012,1.060554,0.918089,0.964555
min,-1.15589,-0.740447,-0.980854,-1.199054
25%,-0.943128,-0.572656,-0.487743,-1.082031
50%,-0.624621,0.118558,-0.038495,-0.62628
75%,0.018945,1.222073,0.871436,0.203118
max,0.64454,1.65922,1.310864,1.217529


Creating a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) by passing a dict of objects that can be converted to series-like.

In [23]:
df2 = pd.DataFrame(
    {
        'A': 1.0,
        'B': pd.Timestamp('20210101'),
        'C': pd.Series(1, index=list(range(4)), dtype='float64'),
        'D': np.array([3]*4, dtype='int64'),
        'E': pd.Categorical(['test', 'train', 'test', 'train']),
        'F': 'foo',
    }
)

df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2021-01-01,1.0,3,test,foo
1,1.0,2021-01-01,1.0,3,train,foo
2,1.0,2021-01-01,1.0,3,test,foo
3,1.0,2021-01-01,1.0,3,train,foo


Interesting. A single entry (such as 1 of column A, timestamp of column B, and so on) are just extended.

In this example, the number of records seems to be determined by this line:

```
        'C': pd.Series(1, index=list(range(4)), dtype='float64'),
```

This determines the number of records of this DataFrame is four.

Columns A, B, and F are just a single entry. They are made to a [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series). Scalar `1` is turned to `[1, 1, 1, 1]`.


If you give 5 as multiplier on column D:
```
        'D': np.array([3]*5, dtype='int64'),
```

Pandas complains that the number of records is not consistent with other part of code.

Also, Pandas cannot extend a list. If you give this on column E:
```
        'E': pd.Categorical(['test', 'train']),
```
Pandas does not bother to repeat this 2-entry list and make it four.


This is much like broadcasting of Excel [Dynamic Arrays](https://techcommunity.microsoft.com/t5/excel-blog/preview-of-dynamic-arrays-in-excel/ba-p/252944).

[R](https://www.r-project.org/) takes a different approach when it finds a situation like this. It tries to extend/multiply to make its series fit to the entire dataframe.

Different languages takes different approaches. Very interesting.

In [24]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float64
D             int64
E          category
F            object
dtype: object

### [Viewing data](https://pandas.pydata.org/docs/user_guide/10min.html#viewing-data)

In [27]:
df.index

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')

In [28]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [29]:
# Transposing

df.T

Unnamed: 0,2021-01-01,2021-01-02,2021-01-03,2021-01-04,2021-01-05,2021-01-06
A,-0.965594,-1.15589,0.64454,-0.373512,0.149764,-0.87573
B,0.60675,1.427181,-0.64033,-0.740447,-0.369635,1.65922
C,0.30955,1.058731,1.310864,-0.38654,-0.980854,-0.521478
D,-0.383379,-0.869181,1.217529,-1.152981,0.398618,-1.199054


In [30]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2021-01-04,-0.373512,-0.740447,-0.38654,-1.152981
2021-01-03,0.64454,-0.64033,1.310864,1.217529
2021-01-05,0.149764,-0.369635,-0.980854,0.398618
2021-01-01,-0.965594,0.60675,0.30955,-0.383379
2021-01-02,-1.15589,1.427181,1.058731,-0.869181
2021-01-06,-0.87573,1.65922,-0.521478,-1.199054


### [Selection](https://pandas.pydata.org/docs/user_guide/10min.html#selection)

In [31]:
### Selecting one column
df['A']
### This returns a Series.

2021-01-01   -0.965594
2021-01-02   -1.155890
2021-01-03    0.644540
2021-01-04   -0.373512
2021-01-05    0.149764
2021-01-06   -0.875730
Freq: D, Name: A, dtype: float64

In [32]:
df.A

2021-01-01   -0.965594
2021-01-02   -1.155890
2021-01-03    0.644540
2021-01-04   -0.373512
2021-01-05    0.149764
2021-01-06   -0.875730
Freq: D, Name: A, dtype: float64

In [34]:
### Selecting multiple columns
df[['B', 'D']]

Unnamed: 0,B,D
2021-01-01,0.60675,-0.383379
2021-01-02,1.427181,-0.869181
2021-01-03,-0.64033,1.217529
2021-01-04,-0.740447,-1.152981
2021-01-05,-0.369635,0.398618
2021-01-06,1.65922,-1.199054


In [35]:
### Slicing also works on rows. Interesting at best, confusing for beginners.
df[0:2]

Unnamed: 0,A,B,C,D
2021-01-01,-0.965594,0.60675,0.30955,-0.383379
2021-01-02,-1.15589,1.427181,1.058731,-0.869181


In [36]:
### Also works on index. Wow...
df['20210102':'20210104']

Unnamed: 0,A,B,C,D
2021-01-02,-1.15589,1.427181,1.058731,-0.869181
2021-01-03,0.64454,-0.64033,1.310864,1.217529
2021-01-04,-0.373512,-0.740447,-0.38654,-1.152981


### [Selection by label](https://pandas.pydata.org/docs/user_guide/10min.html#selection-by-label)

[pandas.DataFrame.loc()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) has a variety of usages.

In [37]:
# selecting a row
df.loc[x[1]]

A   -1.155890
B    1.427181
C    1.058731
D   -0.869181
Name: 2021-01-02 00:00:00, dtype: float64

In [38]:
df.loc[:, ('B', 'D')]

Unnamed: 0,B,D
2021-01-01,0.60675,-0.383379
2021-01-02,1.427181,-0.869181
2021-01-03,-0.64033,1.217529
2021-01-04,-0.740447,-1.152981
2021-01-05,-0.369635,0.398618
2021-01-06,1.65922,-1.199054


In [40]:
df.loc[:, 'B':'D']

Unnamed: 0,B,C,D
2021-01-01,0.60675,0.30955,-0.383379
2021-01-02,1.427181,1.058731,-0.869181
2021-01-03,-0.64033,1.310864,1.217529
2021-01-04,-0.740447,-0.38654,-1.152981
2021-01-05,-0.369635,-0.980854,0.398618
2021-01-06,1.65922,-0.521478,-1.199054


In [42]:
df.loc['20210101':'20210103', ('A', 'C')]

Unnamed: 0,A,C
2021-01-01,-0.965594,0.30955
2021-01-02,-1.15589,1.058731
2021-01-03,0.64454,1.310864


okay, so DataFrame.loc() takes row as first argument, columns as second. Looks like that.

### [Seelction by position](https://pandas.pydata.org/docs/user_guide/10min.html#selection-by-position)

[pandas.DataFrame.iloc()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) lets you select by integer index.

### [Boolean indexing](https://pandas.pydata.org/docs/user_guide/10min.html#boolean-indexing)

In [44]:
df['A'] > 0

2021-01-01    False
2021-01-02    False
2021-01-03     True
2021-01-04    False
2021-01-05     True
2021-01-06    False
Freq: D, Name: A, dtype: bool

In [45]:
df[df['A']>0]
# In this tutorial, this is called a where operation.

Unnamed: 0,A,B,C,D
2021-01-03,0.64454,-0.64033,1.310864,1.217529
2021-01-05,0.149764,-0.369635,-0.980854,0.398618


In [53]:
df2 = df.copy()
df2

Unnamed: 0,A,B,C,D
2021-01-01,-0.965594,0.60675,0.30955,-0.383379
2021-01-02,-1.15589,1.427181,1.058731,-0.869181
2021-01-03,0.64454,-0.64033,1.310864,1.217529
2021-01-04,-0.373512,-0.740447,-0.38654,-1.152981
2021-01-05,0.149764,-0.369635,-0.980854,0.398618
2021-01-06,-0.87573,1.65922,-0.521478,-1.199054


In [54]:
df2['E'] = ('one', 'one', 'two', 'three', 'four', 'three')
df2

Unnamed: 0,A,B,C,D,E
2021-01-01,-0.965594,0.60675,0.30955,-0.383379,one
2021-01-02,-1.15589,1.427181,1.058731,-0.869181,one
2021-01-03,0.64454,-0.64033,1.310864,1.217529,two
2021-01-04,-0.373512,-0.740447,-0.38654,-1.152981,three
2021-01-05,0.149764,-0.369635,-0.980854,0.398618,four
2021-01-06,-0.87573,1.65922,-0.521478,-1.199054,three


In [55]:
df2[df2['E'].isin(('two', 'three'))]

Unnamed: 0,A,B,C,D,E
2021-01-03,0.64454,-0.64033,1.310864,1.217529,two
2021-01-04,-0.373512,-0.740447,-0.38654,-1.152981,three
2021-01-06,-0.87573,1.65922,-0.521478,-1.199054,three


### [Setting](https://pandas.pydata.org/docs/user_guide/10min.html#setting)

In [58]:
s1 = pd.Series((20, 30, 40, 50, 60, 70), index=pd.date_range('20210102', periods=6))
s1

# Note, Series or DataFrame basically need to have index attached to it.

2021-01-02    20
2021-01-03    30
2021-01-04    40
2021-01-05    50
2021-01-06    60
2021-01-07    70
Freq: D, dtype: int64

In [60]:
df['F'] = s1
df

Unnamed: 0,A,B,C,D,F
2021-01-01,-0.965594,0.60675,0.30955,-0.383379,
2021-01-02,-1.15589,1.427181,1.058731,-0.869181,20.0
2021-01-03,0.64454,-0.64033,1.310864,1.217529,30.0
2021-01-04,-0.373512,-0.740447,-0.38654,-1.152981,40.0
2021-01-05,0.149764,-0.369635,-0.980854,0.398618,50.0
2021-01-06,-0.87573,1.65922,-0.521478,-1.199054,60.0


Okay, note that entry of index `2021-01-07` in Series `s1` has been dropped when it is added to DataFrame `df`.

In [64]:
# Set values by label
df.at[x[2], 'C'] = 100
df

Unnamed: 0,A,B,C,D,F
2021-01-01,-0.965594,0.60675,0.30955,-0.383379,
2021-01-02,-1.15589,1.427181,1.058731,-0.869181,20.0
2021-01-03,0.64454,-0.64033,100.0,1.217529,30.0
2021-01-04,-0.373512,-0.740447,-0.38654,-1.152981,40.0
2021-01-05,0.149764,-0.369635,-0.980854,0.398618,50.0
2021-01-06,-0.87573,1.65922,-0.521478,-1.199054,60.0


In [66]:
# iat let you specify by integer index
df.iat[2, 2]

100.0

In [68]:
df.iat[2, 2] = 150
df

Unnamed: 0,A,B,C,D,F
2021-01-01,-0.965594,0.60675,0.30955,-0.383379,
2021-01-02,-1.15589,1.427181,1.058731,-0.869181,20.0
2021-01-03,0.64454,-0.64033,150.0,1.217529,30.0
2021-01-04,-0.373512,-0.740447,-0.38654,-1.152981,40.0
2021-01-05,0.149764,-0.369635,-0.980854,0.398618,50.0
2021-01-06,-0.87573,1.65922,-0.521478,-1.199054,60.0
