# Series

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

In [1]:
import numpy as np
import pandas as pd

### Creating a Series

You can convert a list,numpy array, or dictionary to a Series:

In [2]:
labels = ['a', 'b', 'c']
my_data = [10, 20, 30]
arr = np.array(my_data)
d = {'a': 10, 'b': 20, 'c':30}

In [3]:
pd.Series(data = my_data)

0    10
1    20
2    30
dtype: int64

In [5]:
pd.Series(data = my_data, index = labels)

a    10
b    20
c    30
dtype: int64

In [6]:
pd.Series(my_data, labels)

a    10
b    20
c    30
dtype: int64

In [7]:
pd.Series(arr)

0    10
1    20
2    30
dtype: int32

In [9]:
pd.Series(arr, labels)

a    10
b    20
c    30
dtype: int32

In [10]:
pd.Series(d)

a    10
b    20
c    30
dtype: int64

Pandas automaticamente passam as keys dos dictionaries para o index e os values para os data points

### Data in a Series

A pandas Series can hold a variety of object types:

In [11]:
pd.Series(data=labels)

0    a
1    b
2    c
dtype: object

In [12]:
pd.Series(data=[sum, print, len])

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

In [14]:
# Posso par funções para uma Series, que é algo que nunca podia fazer numa Array/Matriz

## Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

In [15]:
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])
ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [16]:
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])
ser2

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

In [17]:
ser1['USA']

1

In [18]:
ser1 + ser2 # Is going to match the operators based on indexes

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

----

# DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index.

In [19]:
from numpy.random import randn
np.random.seed(101)

In [22]:
df = pd.DataFrame(randn(5,4), index = 'A B C D E'.split(), columns = 'W X Y Z'.split())

# or df = pd.DataFrame(randn(5,4), index = ['A','B','C','D',E], columns = ['W','X','Y','Z'])

df

Unnamed: 0,W,X,Y,Z
A,-0.993263,0.1968,-1.136645,0.000366
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481
E,-0.925874,1.862864,-1.133817,0.610478


## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [23]:
df['W']

A   -0.993263
B    1.025984
C    2.154846
D    0.147027
E   -0.925874
Name: W, dtype: float64

In [24]:
# It looks like a Series, because that is exactly what a W column is, just a Series
type(df['W'])

pandas.core.series.Series

In [25]:
# Pass a list of column names
df[['W','Z']]

Unnamed: 0,W,Z
A,-0.993263,0.000366
B,1.025984,0.649826
C,2.154846,-0.346419
D,0.147027,1.02481
E,-0.925874,0.610478


Bom para escolher colunas que não estejam ordenadas!

**Creating a new column:**

In [32]:
df['new'] = df['W'] + df['Y']

In [33]:
df

Unnamed: 0,W,X,Y,Z,new
A,-0.993263,0.1968,-1.136645,0.000366,-2.129908
B,1.025984,-0.156598,-0.031579,0.649826,0.994405
C,2.154846,-0.610259,-0.755325,-0.346419,1.399521
D,0.147027,-0.479448,0.558769,1.02481,0.705796
E,-0.925874,1.862864,-1.133817,0.610478,-2.059691


**Removing columns:**

In [34]:
df.drop('new', axis = 1)  #because by default axis=0 and I want to refer to the columns not the indexes

Unnamed: 0,W,X,Y,Z
A,-0.993263,0.1968,-1.136645,0.000366
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481
E,-0.925874,1.862864,-1.133817,0.610478


In [35]:
# The output showed our action but did not save it
df

Unnamed: 0,W,X,Y,Z,new
A,-0.993263,0.1968,-1.136645,0.000366,-2.129908
B,1.025984,-0.156598,-0.031579,0.649826,0.994405
C,2.154846,-0.610259,-0.755325,-0.346419,1.399521
D,0.147027,-0.479448,0.558769,1.02481,0.705796
E,-0.925874,1.862864,-1.133817,0.610478,-2.059691


In [36]:
df.drop('new', axis = 1, inplace = True)

In [37]:
df

Unnamed: 0,W,X,Y,Z
A,-0.993263,0.1968,-1.136645,0.000366
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481
E,-0.925874,1.862864,-1.133817,0.610478


In [38]:
df.drop('E', axis = 0)

Unnamed: 0,W,X,Y,Z
A,-0.993263,0.1968,-1.136645,0.000366
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481


In [39]:
df

Unnamed: 0,W,X,Y,Z
A,-0.993263,0.1968,-1.136645,0.000366
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481
E,-0.925874,1.862864,-1.133817,0.610478


In [40]:
df.shape

(5, 4)

**Selecting rows:**

In [48]:
# df['X'] ==> normal procedure to select columns

# To select rows you need to pass down a method

#Pass by label index
df.loc['A']

W   -0.993263
X    0.196800
Y   -1.136645
Z    0.000366
Name: A, dtype: float64

In [49]:
# Pass by numerical index
df.iloc[0:2,:]

Unnamed: 0,W,X,Y,Z
A,-0.993263,0.1968,-1.136645,0.000366
B,1.025984,-0.156598,-0.031579,0.649826


**Select a subset of rows and columns:**

In [50]:
df.loc['B', 'Y']

-0.031579143908112575

In [51]:
df.loc[['A', 'B'],['W','Y']]

Unnamed: 0,W,Y
A,-0.993263,-1.136645
B,1.025984,-0.031579


### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [52]:
df

Unnamed: 0,W,X,Y,Z
A,-0.993263,0.1968,-1.136645,0.000366
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481
E,-0.925874,1.862864,-1.133817,0.610478


In [53]:
df > 0

Unnamed: 0,W,X,Y,Z
A,False,True,False,True
B,True,False,False,True
C,True,False,False,False
D,True,False,True,True
E,False,True,False,True


In [55]:
booldf = df > 0

In [57]:
df[booldf]

Unnamed: 0,W,X,Y,Z
A,,0.1968,,0.000366
B,1.025984,,,0.649826
C,2.154846,,,
D,0.147027,,0.558769,1.02481
E,,1.862864,,0.610478


In [58]:
df[df>0]

Unnamed: 0,W,X,Y,Z
A,,0.1968,,0.000366
B,1.025984,,,0.649826
C,2.154846,,,
D,0.147027,,0.558769,1.02481
E,,1.862864,,0.610478


In [60]:
# More common 
df['W']>0

A    False
B     True
C     True
D     True
E    False
Name: W, dtype: bool

In [61]:
df[df['W']>0]

Unnamed: 0,W,X,Y,Z
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481


In [62]:
# I want to grab all the rows where the value of column Z is less than zero
df[df['Z']<0]

Unnamed: 0,W,X,Y,Z
C,2.154846,-0.610259,-0.755325,-0.346419


In [63]:
df[df['W']>0]

Unnamed: 0,W,X,Y,Z
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481


In [65]:
# O resultado dá-nos um data frame por isso posso aplicar a mesma lógica que fiz atrás
df[df['W']>0]['X']

# resultdf = df[df['W']>0]
# resultdf['X']

B   -0.156598
C   -0.610259
D   -0.479448
Name: X, dtype: float64

In [67]:
df[df['W']>0][['X','Y']]

Unnamed: 0,X,Y
B,-0.156598,-0.031579
C,-0.610259,-0.755325
D,-0.479448,0.558769


In [74]:
boolser = df['W'] > 0   # get a series of boolean results
result = df[boolser]    # turn into a data frame
my_cols = ['X', 'Y']
result[my_cols]        # I get the same result as above

Unnamed: 0,X,Y
B,-0.156598,-0.031579
C,-0.610259,-0.755325
D,-0.479448,0.558769


**Multiple conditions:**

For two conditions you can use | and & with parenthesis:

& == and
| == or

In [76]:
df[(df['W'] > 0) & (df['Y'] > 0)]

Unnamed: 0,W,X,Y,Z
D,0.147027,-0.479448,0.558769,1.02481


In [79]:
df[(df['W'] > 0) | (df['Y'] > 1)]

Unnamed: 0,W,X,Y,Z
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481


## More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

In [80]:
df

Unnamed: 0,W,X,Y,Z
A,-0.993263,0.1968,-1.136645,0.000366
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481
E,-0.925874,1.862864,-1.133817,0.610478


In [84]:
# Reset to default 0,1...n index an to become a column in your data frame:
df.reset_index()

# to saved it pass inplace = True

Unnamed: 0,index,W,X,Y,Z
0,A,-0.993263,0.1968,-1.136645,0.000366
1,B,1.025984,-0.156598,-0.031579,0.649826
2,C,2.154846,-0.610259,-0.755325,-0.346419
3,D,0.147027,-0.479448,0.558769,1.02481
4,E,-0.925874,1.862864,-1.133817,0.610478


In [86]:
newind = 'CA NY WY OR CO'.split()  # trcik to create a list

In [87]:
df['States'] = newind

In [88]:
df

Unnamed: 0,W,X,Y,Z,States
A,-0.993263,0.1968,-1.136645,0.000366,CA
B,1.025984,-0.156598,-0.031579,0.649826,NY
C,2.154846,-0.610259,-0.755325,-0.346419,WY
D,0.147027,-0.479448,0.558769,1.02481,OR
E,-0.925874,1.862864,-1.133817,0.610478,CO


In [90]:
# If I have a column in your data frame that I want to be the index
df.set_index('States')

# new to pass inplace=True

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,-0.993263,0.1968,-1.136645,0.000366
NY,1.025984,-0.156598,-0.031579,0.649826
WY,2.154846,-0.610259,-0.755325,-0.346419
OR,0.147027,-0.479448,0.558769,1.02481
CO,-0.925874,1.862864,-1.133817,0.610478


## Multi-Index and Index Hierarchy

Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

In [99]:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)  # a function

In [95]:
hier_index

MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )

In [98]:
list(zip(outside,inside))  # turn into a list of tuple pairs

[('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]

In [100]:
df = pd.DataFrame(randn(6,2), index = hier_index, columns = ['A', 'B'])

In [101]:
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,0.38603,2.084019
G1,2,-0.376519,0.230336
G1,3,0.681209,1.035125
G2,1,-0.03116,1.939932
G2,2,-1.005187,-0.74179
G2,3,0.187125,-0.732845


Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

In [102]:
df.loc['G1']

Unnamed: 0,A,B
1,0.38603,2.084019
2,-0.376519,0.230336
3,0.681209,1.035125


In [104]:
df.loc['G1'].loc[1]  # you call from outside index and continue to call inside deeper

A    0.386030
B    2.084019
Name: 1, dtype: float64

In [106]:
df.index.names = ['Groups', 'Num']

In [107]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Groups,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,0.38603,2.084019
G1,2,-0.376519,0.230336
G1,3,0.681209,1.035125
G2,1,-0.03116,1.939932
G2,2,-1.005187,-0.74179
G2,3,0.187125,-0.732845


In [110]:
df.loc['G2'].loc[2]['B']

-0.7417897046689249

**Cross section:**

This returns a cross section of rows or columns from a series of data frame and we use this when we have a multi-level index

In [111]:
df.xs('G1')

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.38603,2.084019
2,-0.376519,0.230336
3,0.681209,1.035125


In [112]:
df.xs(1, level = 'Num')

Unnamed: 0_level_0,A,B
Groups,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,0.38603,2.084019
G2,-0.03116,1.939932
