# Operations

There are a lot of miscellaneous operations available in the pandas library which will be highly useful to you. Let us show them in this lecture: -

In [52]:
import numpy as np
import pandas as pd

In [53]:
df = pd.DataFrame(data={
    'col1':[1, 2, 3, 4],
    'col2':[444, 555, 666, 444],
    'col3':['abc', 'def', 'ghi', 'xyz']
})

In [54]:
# show the dataframe
df

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


### Finding unique values

In [55]:
# This returns all the unique values in "col2" of the dataframe
df['col2'].unique()

array([444, 555, 666], dtype=int64)

In [56]:
# This is the first method of counting the number of unique values in a dataframe
len(df['col2'].unique())

3

In [57]:
# This is the second method of counting the number of unique values in a dataframe
df['col2'].nunique()

3

In [58]:
# This returns a frequency table of the unique values present in the dataframe
df['col2'].value_counts()

444    2
555    1
666    1
Name: col2, dtype: int64

### Selecting Data

In [59]:
# show the dataFrame
df

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


In [60]:
# Returns all rows of the dataFrame where "col2" values is greater than 2
df[df['col2']>2]

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


In [61]:
df[(df['col2'] > 2) & (df['col2']==444)]

Unnamed: 0,col1,col2,col3
0,1,444,abc
3,4,444,xyz


### Applying Functions

The apply() method will be one of the most important and powerful tools in your toolbox whilst using pandas

In [62]:
# Let us say we have a function "times2" which takes in a parameter and spits out its double
def times2(x):
    return x*2

In [63]:
# We already know that we can call any built-in pandas methods on part of whole of its dataFrames
df['col1'].sum()

10

In [64]:
# The apply() method allows you to use custom methods like "times2" to part or whole of your dataFrames
df['col1'].apply(times2)

0    2
1    4
2    6
3    8
Name: col1, dtype: int64

As we can see, the apply() method just broadcasted the value spit out by the "times2" method to each entry of the dataframe

In [65]:
# We can also use some built-in methods like "len"
df['col3'].apply(len)
# In this case we simply replace all strings in "col3" by their length

0    3
1    3
2    3
3    3
Name: col3, dtype: int64

The apply() method becomes particularyly powerful when we combine it with lambda expressions. That way, you need to take out the whole time to define a proper function in the correct scope if you're just going to apply it once.

In [66]:
df['col2'].apply(lambda x: x*2)

0     888
1    1110
2    1332
3     888
Name: col2, dtype: int64

**Removing a column from the dataFrame**

In [67]:
# This is one way of removing columns from a dataframe
df.drop('col1', axis=1)

# specify the value of the "inplace" argument as True to get rid of the column completely

Unnamed: 0,col2,col3
0,444,abc
1,555,def
2,666,ghi
3,444,xyz


In [68]:
# Another way of permanently deleting a column from a dataFrame is by using the del command
del df['col1']

In [69]:
# show the dataframe to see changes
df

Unnamed: 0,col2,col3
0,444,abc
1,555,def
2,666,ghi
3,444,xyz


**Get column names and index**

In [70]:
# This shows the names of all the colmns present in the dataframe
df.columns

Index(['col2', 'col3'], dtype='object')

In [71]:
# This shows the index-lablelling of the dataFrame
df.index

RangeIndex(start=0, stop=4, step=1)

**Sorting and ordering a DataFrame**

In [72]:
# show the dataFrame
df

Unnamed: 0,col2,col3
0,444,abc
1,555,def
2,666,ghi
3,444,xyz


In [73]:
# Let us sort the rows of the data frame by values present in "col2"
df.sort_values(by='col2') # inplace=False by default

Unnamed: 0,col2,col3
0,444,abc
3,444,xyz
1,555,def
2,666,ghi


Note how the index remains attached to the row even after reordering the data frame

**Finding or checking for null values**

In [74]:
df.isnull()

Unnamed: 0,col2,col3
0,False,False
1,False,False
2,False,False
3,False,False


The isnull() method returns a dataframe of booleans indicating whether or not the value at each cell was null. In this case, our data frame had no null or NaN values

In [75]:
# drop all rows which have NaN values
df.dropna()

Unnamed: 0,col2,col3
0,444,abc
1,555,def
2,666,ghi
3,444,xyz


In this case, our data frame had no null or NaN values.

**Filling in the NaN values with something else**

In [76]:
# Reconstructing the dataFrame with the same values as before except an extra NaN value
df = pd.DataFrame({'col1':[1,2,3,np.nan],
                   'col2':[np.nan,555,666,444],
                   'col3':['abc','def','ghi','xyz']})
df.head()

Unnamed: 0,col1,col2,col3
0,1.0,,abc
1,2.0,555.0,def
2,3.0,666.0,ghi
3,,444.0,xyz


In [77]:
df.fillna('FILL')

Unnamed: 0,col1,col2,col3
0,1,FILL,abc
1,2,555,def
2,3,666,ghi
3,FILL,444,xyz


### Pivot Tables

In [78]:
data = {'A':['foo','foo','foo','bar','bar','bar'],
     'B':['one','one','two','two','one','one'],
       'C':['x','y','x','y','x','y'],
       'D':[1,3,2,5,4,1]}

df = pd.DataFrame(data)

In [79]:
# show the data frame
df

Unnamed: 0,A,B,C,D
0,foo,one,x,1
1,foo,one,y,3
2,foo,two,x,2
3,bar,two,y,5
4,bar,one,x,4
5,bar,one,y,1


Note that we have repeating values in columns 'A', 'B' and 'C'

In [80]:
df.pivot_table(values='D', index=['A', 'B'], columns='C')

Unnamed: 0_level_0,C,x,y
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,4.0,1.0
bar,two,,5.0
foo,one,1.0,3.0
foo,two,2.0,


The pivot_table() method takes in 3 main arguments: -
    1. The values
    2. The index
    3. The columns
In this case, we indicated that the values we want in our table are from the 'D' column. The index is set to the 'A' and 'B' columns which just means that we will have a multi-level index. Lastly, we define our actual columns to be defined by column 'C'.