# Intro do Data Structures
A quick intro to data structures from [pandas docs](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro). Might be a good idea before diving into more advanced pandas stuff. First, let's import pandas and numpy.

In [1]:
import numpy as np
import pandas as pd

## Series

In [5]:
??pd.Series

### From ndarray
If **data** is *ndarray* then **index** must be the same length as **data**. If no index passed, one will be created having values [0, ..., len(data)-1].

In [6]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [7]:
s

a    2.777195
b    0.176360
c   -0.222080
d    1.181364
e    0.659670
dtype: float64

In [8]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [9]:
pd.Series(np.random.randn(5))

0   -0.638350
1   -0.463546
2   -0.216170
3   -0.100839
4    1.748732
dtype: float64

### From dict
Series can be created from dictionary.

In [10]:
d = {'b': 1, 'a': 0, 'c': 2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

Retains the order of a given dict (might sort by key on lower versions of Python and Pandas.

When an index is passed, it will apply its ordering (and put NaN for keys with no corresponding value).

In [11]:
pd.Series(d, index=['b', 'c', 'd', 'a'])

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

Interestingly, it now converted values to *float64*. Perhaps because of *NaN*? Let's see.

In [12]:
pd.Series(d, index=['b', 'c', 'a'])

b    1
c    2
a    0
dtype: int64

Yep, it seems the *NaN* value forces other values to be represented as *float64*.

### From scalar value
If *data* is a scalar value, an index must be provided. The scalar value then will be repeated for each index element.

In [13]:
pd.Series(5., index=['a', 'b', 'c', 'd', 'c'])

a    5.0
b    5.0
c    5.0
d    5.0
c    5.0
dtype: float64

### Series is ndarray-like
*Series* behaves very similarly to a *ndarray* and is a valid argument to most *NumPy* functions. Beware than slicing also slices the index.

In [14]:
s

a    2.777195
b    0.176360
c   -0.222080
d    1.181364
e    0.659670
dtype: float64

In [15]:
s[0]

2.777195011275134

In [16]:
s[:3]

a    2.777195
b    0.176360
c   -0.222080
dtype: float64

In [17]:
s[s > s.median()]

a    2.777195
d    1.181364
dtype: float64

In [18]:
s.median()

0.6596695272283994

In [19]:
s[[4, 3, 1]]

e    0.659670
d    1.181364
b    0.176360
dtype: float64

In [20]:
np.exp(s)

a    16.073871
b     1.192867
c     0.800851
d     3.258817
e     1.934153
dtype: float64

### Series is dict-like
Basically, it is possible to get and set values by index label.

In [21]:
s['a']

2.777195011275134

In [22]:
s

a    2.777195
b    0.176360
c   -0.222080
d    1.181364
e    0.659670
dtype: float64

In [23]:
s['e'] = 12

In [24]:
s

a     2.777195
b     0.176360
c    -0.222080
d     1.181364
e    12.000000
dtype: float64

In [25]:
'e' in s

True

In [26]:
'f' in s

False

In [28]:
# s['f'] - will throw an exception

In [29]:
s['g'] = .3
s

a     2.777195
b     0.176360
c    -0.222080
d     1.181364
e    12.000000
g     0.300000
dtype: float64

In [30]:
s.get('f')

In [31]:
s.get('g')

0.3

In [32]:
s.get('f', np.nan)

nan

### Vectorized operations and label alignment with Series
As with *NumPy* arrays, is it usually not necessary to loop through Series values. Series can also be passed into most *NumPy* methods expecting an ndarray.

In [33]:
s

a     2.777195
b     0.176360
c    -0.222080
d     1.181364
e    12.000000
g     0.300000
dtype: float64

In [34]:
s + s

a     5.554390
b     0.352719
c    -0.444161
d     2.362729
e    24.000000
g     0.600000
dtype: float64

In [35]:
s * 2

a     5.554390
b     0.352719
c    -0.444161
d     2.362729
e    24.000000
g     0.600000
dtype: float64

In [36]:
np.exp(s)

a        16.073871
b         1.192867
c         0.800851
d         3.258817
e    162754.791419
g         1.349859
dtype: float64

A key differece between *Series* and *ndarray* is that Series automatically aligns the data based on labels.

In [38]:
s

a     2.777195
b     0.176360
c    -0.222080
d     1.181364
e    12.000000
g     0.300000
dtype: float64

In [39]:
s[1:]

b     0.176360
c    -0.222080
d     1.181364
e    12.000000
g     0.300000
dtype: float64

In [40]:
s[:-1]

a     2.777195
b     0.176360
c    -0.222080
d     1.181364
e    12.000000
dtype: float64

In [41]:
s[1:] + s[:-1]

a          NaN
b     0.352719
c    -0.444161
d     2.362729
e    24.000000
g          NaN
dtype: float64

### Name attribute

In [43]:
s.rename("new name")

a     2.777195
b     0.176360
c    -0.222080
d     1.181364
e    12.000000
g     0.300000
Name: new name, dtype: float64

## DataFrame
2-dimensional labeled data structure with columns of potentially different types. May resemble a spreadsheet or SQL table. Generally the most commonly used pandas object. The following can be optionally passed in as arguments:
- **index** (row labels)
- **columns** (column labels)

### From dict of Series or dicts
Nested dics are converted into Series.

In [22]:
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

In [23]:
df = pd.DataFrame(d)

In [24]:
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [47]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [48]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


In [49]:
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [50]:
df.columns

Index(['one', 'two'], dtype='object')

### From dict of ndarrays / lists
All *ndarrays* must be the same length. If an index is passed, it must be also the same length as the arrays. As with Series, when no index defined, it will be automatically created as range(length_of_the_array).

In [51]:
d = {'one': [1., 2., 3., 4.],
     'two': [4., 3., 2., 1.]}

In [52]:
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [53]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


### From structured or record array

In [10]:
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])

In [11]:
data[:] = [(1,2.,'Hello'), (2,3.,"World")]

In [12]:
data

array([(1, 2., b'Hello'), (2, 3., b'World')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [13]:
pd.DataFrame(data)

Unnamed: 0,A,B,C
0,1,2.0,b'Hello'
1,2,3.0,b'World'


In [14]:
pd.DataFrame(data, index=['first', 'second'])

Unnamed: 0,A,B,C
first,1,2.0,b'Hello'
second,2,3.0,b'World'


In [15]:
pd.DataFrame(data, columns=['C', 'A', 'B'])

Unnamed: 0,C,A,B
0,b'Hello',1,2.0
1,b'World',2,3.0


### From a list of dicts

In [16]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [17]:
pd.DataFrame(data2)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [18]:
pd.DataFrame(data2, index=['first', 'second'])

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


In [19]:
pd.DataFrame(data2, columns=['a', 'b'])

Unnamed: 0,a,b
0,1,2
1,5,10


### From a dict of tuples

In [20]:
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
              ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
              ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
              ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
              ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,b,a,c,a,b
A,B,1.0,4.0,5.0,8.0,10.0
A,C,2.0,3.0,6.0,7.0,
A,D,,,,,9.0


### Column selection, addition and deletion
These operations basically work similarly to dictionary operations.

In [25]:
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [26]:
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [27]:
df['three'] = df['one'] * df['two']

In [28]:
df

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,2.0,4.0
c,3.0,3.0,9.0
d,,4.0,


In [29]:
df['flag'] = df['one'] > 2

In [30]:
df

Unnamed: 0,one,two,three,flag
a,1.0,1.0,1.0,False
b,2.0,2.0,4.0,False
c,3.0,3.0,9.0,True
d,,4.0,,False


In [31]:
del df['two']

In [32]:
three = df.pop('three')

In [33]:
df

Unnamed: 0,one,flag
a,1.0,False
b,2.0,False
c,3.0,True
d,,False


In [34]:
three

a    1.0
b    4.0
c    9.0
d    NaN
Name: three, dtype: float64

When inserting a scalar value it will be propagated (broadcasted?) into all rows.

In [35]:
df['foo'] = 'bar'

In [36]:
df

Unnamed: 0,one,flag,foo
a,1.0,False,bar
b,2.0,False,bar
c,3.0,True,bar
d,,False,bar


When inserting a Series with missing indexes these will become *NaN*s.

In [37]:
df['one_trunc'] = df['one'][:2]

In [38]:
df

Unnamed: 0,one,flag,foo,one_trunc
a,1.0,False,bar,1.0
b,2.0,False,bar,2.0
c,3.0,True,bar,
d,,False,bar,


In [39]:
df.insert(1, 'bar', df['one'])

In [40]:
df

Unnamed: 0,one,bar,flag,foo,one_trunc
a,1.0,1.0,False,bar,1.0
b,2.0,2.0,False,bar,2.0
c,3.0,3.0,True,bar,
d,,,False,bar,


### Indexing and selection
|  **Operation**                  | **Syntax**    | **Result** |
| :---                           | :---          | :---       |
| Select column                   | df[col]       | Series     |
| Select row by label            | df.loc[label] | Series     |
| Select row by integer location | df.iloc[loc]  | Series     |
| Slice rows                     | df[5:10]      | DataFrame  |
| Select rows by boolean vector  | df[bool_vec]  | DataFrame  |

In [41]:
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [42]:
df.loc['b']

one              2
bar              2
flag         False
foo            bar
one_trunc        2
Name: b, dtype: object

In [43]:
df.iloc[1]

one              2
bar              2
flag         False
foo            bar
one_trunc        2
Name: b, dtype: object

### Data alignment and arithmetic

In [2]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])

In [3]:
df

Unnamed: 0,A,B,C,D
0,1.35814,-0.238029,0.253981,-0.702935
1,-0.94497,-0.315386,0.445119,0.418373
2,-0.079441,-0.326678,2.009893,0.695473
3,-0.984577,-1.600205,-1.562719,0.875349
4,-0.006379,0.318474,0.833109,1.066716
5,-0.631727,1.243224,-0.987038,1.98429
6,0.120345,-0.079509,-0.576918,1.933788
7,1.94095,-0.934306,0.290365,0.068711
8,-0.284526,-0.996492,0.540494,0.156474
9,1.360262,-1.479145,-1.338868,-1.969697


In [4]:
df2

Unnamed: 0,A,B,C
0,-0.139781,0.421521,-0.232604
1,-1.827624,0.708037,1.335907
2,-0.109626,0.892878,1.78193
3,0.640909,0.975686,-0.167822
4,0.900496,-1.599447,-0.640741
5,0.308413,0.528466,0.57717
6,0.516765,-0.194418,-1.027327


In [5]:
df + df2

Unnamed: 0,A,B,C,D
0,1.218359,0.183492,0.021378,
1,-2.772593,0.392651,1.781026,
2,-0.189067,0.5662,3.791823,
3,-0.343668,-0.624519,-1.730541,
4,0.894117,-1.280973,0.192368,
5,-0.323314,1.771691,-0.409868,
6,0.63711,-0.273927,-1.604245,
7,,,,
8,,,,
9,,,,


In [6]:
df

Unnamed: 0,A,B,C,D
0,1.35814,-0.238029,0.253981,-0.702935
1,-0.94497,-0.315386,0.445119,0.418373
2,-0.079441,-0.326678,2.009893,0.695473
3,-0.984577,-1.600205,-1.562719,0.875349
4,-0.006379,0.318474,0.833109,1.066716
5,-0.631727,1.243224,-0.987038,1.98429
6,0.120345,-0.079509,-0.576918,1.933788
7,1.94095,-0.934306,0.290365,0.068711
8,-0.284526,-0.996492,0.540494,0.156474
9,1.360262,-1.479145,-1.338868,-1.969697


In [7]:
df - df.iloc[0]

Unnamed: 0,A,B,C,D
0,0.0,0.0,0.0,0.0
1,-2.30311,-0.077357,0.191138,1.121308
2,-1.437581,-0.088649,1.755911,1.398408
3,-2.342718,-1.362176,-1.816701,1.578284
4,-1.364519,0.556503,0.579127,1.769651
5,-1.989867,1.481253,-1.241019,2.687225
6,-1.237795,0.15852,-0.830899,2.636723
7,0.58281,-0.696277,0.036384,0.771646
8,-1.642667,-0.758463,0.286512,0.859409
9,0.002121,-1.241116,-1.59285,-1.266762


Subtracting a single row from an entire DataFrame results in a DataFrame where the aforementioned column has been subtracted from every DataFrame's row.

More scalar operations.

In [9]:
df * 5 + 2

Unnamed: 0,A,B,C,D
0,8.790702,0.809855,3.269907,-1.514676
1,-2.724848,0.423071,4.225594,4.091863
2,1.602795,0.36661,12.049463,5.477366
3,-2.922887,-6.001024,-5.813596,6.376745
4,1.968106,3.592369,6.165544,7.333579
5,-1.158633,8.216121,-2.935188,11.921448
6,2.601725,1.602454,-0.884588,11.668938
7,11.704752,-2.671532,3.451825,2.343554
8,0.577369,-2.982458,4.702468,2.78237
9,8.801308,-5.395725,-4.694341,-7.848486


In [10]:
1 / df

Unnamed: 0,A,B,C,D
0,0.736301,-4.20117,3.937297,-1.422606
1,-1.058235,-3.170719,2.246591,2.390214
2,-12.587954,-3.061119,0.497539,1.43787
3,-1.015664,-0.62492,-0.63991,1.142402
4,-156.770239,3.139977,1.200323,0.937457
5,-1.582963,0.80436,-1.013133,0.503959
6,8.309445,-12.577176,-1.73335,0.51712
7,0.515212,-1.070313,3.443942,14.553736
8,-3.514614,-1.003521,1.850161,6.390838
9,0.735153,-0.676066,-0.7469,-0.507692


In [11]:
df ** 2

Unnamed: 0,A,B,C,D
0,1.844545,0.056658,0.064507,0.494118
1,0.892968,0.099468,0.198131,0.175036
2,0.006311,0.106718,4.039668,0.483683
3,0.969393,2.560655,2.442092,0.766236
4,4.1e-05,0.101426,0.69407,1.137883
5,0.399078,1.545606,0.974243,3.937405
6,0.014483,0.006322,0.332834,3.739534
7,3.767288,0.872928,0.084312,0.004721
8,0.080955,0.992995,0.292133,0.024484
9,1.850312,2.18787,1.792568,3.879707


Boolean operators.

In [14]:
df1 = pd.DataFrame({'a': [1, 0, 1], 'b': [0, 1, 1]}, dtype=bool)
df2 = pd.DataFrame({'a': [0, 1, 1], 'b': [1, 1, 0]}, dtype=bool)

In [15]:
df1

Unnamed: 0,a,b
0,True,False
1,False,True
2,True,True


In [16]:
df2

Unnamed: 0,a,b
0,False,True
1,True,True
2,True,False


In [17]:
df1 & df2

Unnamed: 0,a,b
0,False,False
1,False,True
2,True,False


In [19]:
df1 | df2

Unnamed: 0,a,b
0,True,True
1,True,True
2,True,True


In [20]:
df1 ^ df2

Unnamed: 0,a,b
0,True,True
1,True,False
2,False,True


In [23]:
-df1

Unnamed: 0,a,b
0,False,True
1,True,False
2,False,False


### Transposing
Use **T** attribute (property).

In [24]:
df[:5].T

Unnamed: 0,0,1,2,3,4
A,1.35814,-0.94497,-0.079441,-0.984577,-0.006379
B,-0.238029,-0.315386,-0.326678,-1.600205,0.318474
C,0.253981,0.445119,2.009893,-1.562719,0.833109
D,-0.702935,0.418373,0.695473,0.875349,1.066716


### Interoperability with NumPy functions

In [25]:
np.exp(df)

Unnamed: 0,A,B,C,D
0,3.888955,0.78818,1.289148,0.49513
1,0.388691,0.729507,1.560676,1.519487
2,0.923632,0.721316,7.462515,2.004657
3,0.373597,0.201855,0.209565,2.399713
4,0.993642,1.375027,2.300459,2.905821
5,0.531673,3.466773,0.372679,7.273878
6,1.127886,0.92357,0.561627,6.915654
7,6.965368,0.392858,1.336915,1.071126
8,0.752371,0.369172,1.716854,1.16938
9,3.897213,0.227832,0.262142,0.139499


In [26]:
np.asarray(df)

array([[ 1.35814036, -0.23802893,  0.25398132, -0.7029353 ],
       [-0.94496962, -0.31538585,  0.44511886,  0.41837256],
       [-0.07944103, -0.32667795,  2.00989254,  0.69547311],
       [-0.98457738, -1.60020482, -1.56271927,  0.87534898],
       [-0.00637876,  0.31847371,  0.83310877,  1.06671587],
       [-0.63172655,  1.24322418, -0.98703763,  1.98428954],
       [ 0.12034499, -0.0795091 , -0.57691765,  1.93378759],
       [ 1.94095037, -0.93430641,  0.29036495,  0.06871088],
       [-0.28452625, -0.99649155,  0.5404936 ,  0.156474  ],
       [ 1.36026162, -1.47914498, -1.33886821, -1.9696972 ]])

DataFrame implements matrix multiplication as *dot* method.

In [27]:
df.T.dot(df)

Unnamed: 0,A,B,C,D
A,9.825374,-2.762686,0.440651,-5.885203
B,-2.762686,8.530647,1.897822,3.753666
C,0.440651,1.897822,10.914558,0.59377
D,-5.885203,3.753666,0.59377,14.642807


Similarly, the dot method is available for Series.

In [29]:
s1 = pd.Series(np.arange(5, 10))

In [30]:
s1 

0    5
1    6
2    7
3    8
4    9
dtype: int32

In [31]:
s1.dot(s1)

255

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
A    10 non-null float64
B    10 non-null float64
C    10 non-null float64
D    10 non-null float64
dtypes: float64(4)
memory usage: 400.0 bytes


In [33]:
df.to_string()

'          A         B         C         D\n0  1.358140 -0.238029  0.253981 -0.702935\n1 -0.944970 -0.315386  0.445119  0.418373\n2 -0.079441 -0.326678  2.009893  0.695473\n3 -0.984577 -1.600205 -1.562719  0.875349\n4 -0.006379  0.318474  0.833109  1.066716\n5 -0.631727  1.243224 -0.987038  1.984290\n6  0.120345 -0.079509 -0.576918  1.933788\n7  1.940950 -0.934306  0.290365  0.068711\n8 -0.284526 -0.996492  0.540494  0.156474\n9  1.360262 -1.479145 -1.338868 -1.969697'

In [34]:
df.head()

Unnamed: 0,A,B,C,D
0,1.35814,-0.238029,0.253981,-0.702935
1,-0.94497,-0.315386,0.445119,0.418373
2,-0.079441,-0.326678,2.009893,0.695473
3,-0.984577,-1.600205,-1.562719,0.875349
4,-0.006379,0.318474,0.833109,1.066716


In [35]:
df

Unnamed: 0,A,B,C,D
0,1.35814,-0.238029,0.253981,-0.702935
1,-0.94497,-0.315386,0.445119,0.418373
2,-0.079441,-0.326678,2.009893,0.695473
3,-0.984577,-1.600205,-1.562719,0.875349
4,-0.006379,0.318474,0.833109,1.066716
5,-0.631727,1.243224,-0.987038,1.98429
6,0.120345,-0.079509,-0.576918,1.933788
7,1.94095,-0.934306,0.290365,0.068711
8,-0.284526,-0.996492,0.540494,0.156474
9,1.360262,-1.479145,-1.338868,-1.969697


In [36]:
df.A

0    1.358140
1   -0.944970
2   -0.079441
3   -0.984577
4   -0.006379
5   -0.631727
6    0.120345
7    1.940950
8   -0.284526
9    1.360262
Name: A, dtype: float64

## Finish
That's it!