# <a id='contents'>Contents</a>

* [Importing](#import)
* [Working with Categorical Data](#category)
* [Working with Conditions](#condition)

# <a id='import'>Importing</a>

In [2]:
import pandas  as pd

# <a id='category'>Working with Categorical Data</a>

Preparing the data.

In [3]:
df = pd.read_csv('data/Iris.csv')

In [92]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [93]:
df.shape

(150, 6)

Of course, not all columns are expected to be numerical by default. For instance, Species column is categorical. Let's analyze it by using unique method.

In [94]:
df['Species']

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: Species, Length: 150, dtype: object

In [95]:
df['Species'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

We have 3 unique categories in that column. We can also determine number of instances in each unique category. See value_counts() method

In [96]:
df['Species'].value_counts()

Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

# <a id='condition'>Working with Conditions</a>

What we have done in NumPy as a boolean search has a perfect similar syntax with Pandas. Let's get the array version of DataFrame and search for some instances under conditions.

In [97]:
arr = df.values

In [98]:
df.columns

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [138]:
# the following array refers to the second column called SepalLengthCm
condition_1 = arr[:, 1] > 6.7

In [100]:
arr[condition_1]

array([[51, 7.0, 3.2, 4.7, 1.4, 'Iris-versicolor'],
       [53, 6.9, 3.1, 4.9, 1.5, 'Iris-versicolor'],
       [77, 6.8, 2.8, 4.8, 1.4, 'Iris-versicolor'],
       [103, 7.1, 3.0, 5.9, 2.1, 'Iris-virginica'],
       [106, 7.6, 3.0, 6.6, 2.1, 'Iris-virginica'],
       [108, 7.3, 2.9, 6.3, 1.8, 'Iris-virginica'],
       [110, 7.2, 3.6, 6.1, 2.5, 'Iris-virginica'],
       [113, 6.8, 3.0, 5.5, 2.1, 'Iris-virginica'],
       [118, 7.7, 3.8, 6.7, 2.2, 'Iris-virginica'],
       [119, 7.7, 2.6, 6.9, 2.3, 'Iris-virginica'],
       [121, 6.9, 3.2, 5.7, 2.3, 'Iris-virginica'],
       [123, 7.7, 2.8, 6.7, 2.0, 'Iris-virginica'],
       [126, 7.2, 3.2, 6.0, 1.8, 'Iris-virginica'],
       [130, 7.2, 3.0, 5.8, 1.6, 'Iris-virginica'],
       [131, 7.4, 2.8, 6.1, 1.9, 'Iris-virginica'],
       [132, 7.9, 3.8, 6.4, 2.0, 'Iris-virginica'],
       [136, 7.7, 3.0, 6.1, 2.3, 'Iris-virginica'],
       [140, 6.9, 3.1, 5.4, 2.1, 'Iris-virginica'],
       [142, 6.9, 3.1, 5.1, 2.3, 'Iris-virginica'],
       [144,

So, how do we apply the same logic in Pandas? See the below example

In [101]:
condition = df['SepalLengthCm'] > 6.7

In [102]:
# The series object with boolean data
condition

0      False
1      False
2      False
3      False
4      False
       ...  
145    False
146    False
147    False
148    False
149    False
Name: SepalLengthCm, Length: 150, dtype: bool

In [103]:
df.loc[condition]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
50,51,7.0,3.2,4.7,1.4,Iris-versicolor
52,53,6.9,3.1,4.9,1.5,Iris-versicolor
76,77,6.8,2.8,4.8,1.4,Iris-versicolor
102,103,7.1,3.0,5.9,2.1,Iris-virginica
105,106,7.6,3.0,6.6,2.1,Iris-virginica
107,108,7.3,2.9,6.3,1.8,Iris-virginica
109,110,7.2,3.6,6.1,2.5,Iris-virginica
112,113,6.8,3.0,5.5,2.1,Iris-virginica
117,118,7.7,3.8,6.7,2.2,Iris-virginica
118,119,7.7,2.6,6.9,2.3,Iris-virginica


There we have it! The same output at different data types. Now, let's apply multiple conditions.

In [104]:
condition_1 = arr[:, 1] > 5.5
condition_2 = arr[:, 2] < 3

arr[(condition_1 & condition_2)]

array([[55, 6.5, 2.8, 4.6, 1.5, 'Iris-versicolor'],
       [56, 5.7, 2.8, 4.5, 1.3, 'Iris-versicolor'],
       [59, 6.6, 2.9, 4.6, 1.3, 'Iris-versicolor'],
       [63, 6.0, 2.2, 4.0, 1.0, 'Iris-versicolor'],
       [64, 6.1, 2.9, 4.7, 1.4, 'Iris-versicolor'],
       [65, 5.6, 2.9, 3.6, 1.3, 'Iris-versicolor'],
       [68, 5.8, 2.7, 4.1, 1.0, 'Iris-versicolor'],
       [69, 6.2, 2.2, 4.5, 1.5, 'Iris-versicolor'],
       [70, 5.6, 2.5, 3.9, 1.1, 'Iris-versicolor'],
       [72, 6.1, 2.8, 4.0, 1.3, 'Iris-versicolor'],
       [73, 6.3, 2.5, 4.9, 1.5, 'Iris-versicolor'],
       [74, 6.1, 2.8, 4.7, 1.2, 'Iris-versicolor'],
       [75, 6.4, 2.9, 4.3, 1.3, 'Iris-versicolor'],
       [77, 6.8, 2.8, 4.8, 1.4, 'Iris-versicolor'],
       [79, 6.0, 2.9, 4.5, 1.5, 'Iris-versicolor'],
       [80, 5.7, 2.6, 3.5, 1.0, 'Iris-versicolor'],
       [83, 5.8, 2.7, 3.9, 1.2, 'Iris-versicolor'],
       [84, 6.0, 2.7, 5.1, 1.6, 'Iris-versicolor'],
       [88, 6.3, 2.3, 4.4, 1.3, 'Iris-versicolor'],
       [93, 

In [105]:
condition_1 = df['SepalLengthCm'] > 5.5
condition_2 = df['SepalWidthCm'] < 3

In [106]:
df.loc[condition_1 & condition_2]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
54,55,6.5,2.8,4.6,1.5,Iris-versicolor
55,56,5.7,2.8,4.5,1.3,Iris-versicolor
58,59,6.6,2.9,4.6,1.3,Iris-versicolor
62,63,6.0,2.2,4.0,1.0,Iris-versicolor
63,64,6.1,2.9,4.7,1.4,Iris-versicolor
64,65,5.6,2.9,3.6,1.3,Iris-versicolor
67,68,5.8,2.7,4.1,1.0,Iris-versicolor
68,69,6.2,2.2,4.5,1.5,Iris-versicolor
69,70,5.6,2.5,3.9,1.1,Iris-versicolor
71,72,6.1,2.8,4.0,1.3,Iris-versicolor


We can also take sub-DataFrame and work on separately.

In [107]:
subdf = df.loc[condition_1 & condition_2]

In [108]:
subdf['Species'].value_counts()

Species
Iris-versicolor    24
Iris-virginica     20
Name: count, dtype: int64

Applying other operators

In [109]:
df.loc[~condition_1 | condition_2]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
132,133,6.4,2.8,5.6,2.2,Iris-virginica
133,134,6.3,2.8,5.1,1.5,Iris-virginica
134,135,6.1,2.6,5.6,1.4,Iris-virginica
142,143,5.8,2.7,5.1,1.9,Iris-virginica


Calling specific columns from the chosen sub-DataFrame.

The following returns series object since we call one column as 1D

In [110]:
df.loc[~condition_1 | condition_2, 'Species'].head()

0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: Species, dtype: object

When we put it in brackets, it becomes new DataFrame.

In [111]:
df.loc[~condition_1 | condition_2, ['Species']].head()

Unnamed: 0,Species
0,Iris-setosa
1,Iris-setosa
2,Iris-setosa
3,Iris-setosa
4,Iris-setosa


You can also call multiple columns

In [112]:
df.loc[~condition_1 | condition_2, ['PetalLengthCm', 'Species']].head()

Unnamed: 0,PetalLengthCm,Species
0,1.4,Iris-setosa
1,1.4,Iris-setosa
2,1.3,Iris-setosa
3,1.5,Iris-setosa
4,1.4,Iris-setosa


NumPy array version

In [113]:
arr[~condition_1 | condition_2][:, [3, 5]]

array([[1.4, 'Iris-setosa'],
       [1.4, 'Iris-setosa'],
       [1.3, 'Iris-setosa'],
       [1.5, 'Iris-setosa'],
       [1.4, 'Iris-setosa'],
       [1.7, 'Iris-setosa'],
       [1.4, 'Iris-setosa'],
       [1.5, 'Iris-setosa'],
       [1.4, 'Iris-setosa'],
       [1.5, 'Iris-setosa'],
       [1.5, 'Iris-setosa'],
       [1.6, 'Iris-setosa'],
       [1.4, 'Iris-setosa'],
       [1.1, 'Iris-setosa'],
       [1.3, 'Iris-setosa'],
       [1.4, 'Iris-setosa'],
       [1.5, 'Iris-setosa'],
       [1.7, 'Iris-setosa'],
       [1.5, 'Iris-setosa'],
       [1.0, 'Iris-setosa'],
       [1.7, 'Iris-setosa'],
       [1.9, 'Iris-setosa'],
       [1.6, 'Iris-setosa'],
       [1.6, 'Iris-setosa'],
       [1.5, 'Iris-setosa'],
       [1.4, 'Iris-setosa'],
       [1.6, 'Iris-setosa'],
       [1.6, 'Iris-setosa'],
       [1.5, 'Iris-setosa'],
       [1.5, 'Iris-setosa'],
       [1.4, 'Iris-setosa'],
       [1.5, 'Iris-setosa'],
       [1.2, 'Iris-setosa'],
       [1.3, 'Iris-setosa'],
       [1.5, '

In case you want to work with indexes, iloc can be used. 

In [114]:
df.iloc[:, [0, 1, 2]]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm
0,1,5.1,3.5
1,2,4.9,3.0
2,3,4.7,3.2
3,4,4.6,3.1
4,5,5.0,3.6
...,...,...,...
145,146,6.7,3.0
146,147,6.3,2.5
147,148,6.5,3.0
148,149,6.2,3.4


In [115]:
arr[:, [0, 1, 2]]

array([[1, 5.1, 3.5],
       [2, 4.9, 3.0],
       [3, 4.7, 3.2],
       [4, 4.6, 3.1],
       [5, 5.0, 3.6],
       [6, 5.4, 3.9],
       [7, 4.6, 3.4],
       [8, 5.0, 3.4],
       [9, 4.4, 2.9],
       [10, 4.9, 3.1],
       [11, 5.4, 3.7],
       [12, 4.8, 3.4],
       [13, 4.8, 3.0],
       [14, 4.3, 3.0],
       [15, 5.8, 4.0],
       [16, 5.7, 4.4],
       [17, 5.4, 3.9],
       [18, 5.1, 3.5],
       [19, 5.7, 3.8],
       [20, 5.1, 3.8],
       [21, 5.4, 3.4],
       [22, 5.1, 3.7],
       [23, 4.6, 3.6],
       [24, 5.1, 3.3],
       [25, 4.8, 3.4],
       [26, 5.0, 3.0],
       [27, 5.0, 3.4],
       [28, 5.2, 3.5],
       [29, 5.2, 3.4],
       [30, 4.7, 3.2],
       [31, 4.8, 3.1],
       [32, 5.4, 3.4],
       [33, 5.2, 4.1],
       [34, 5.5, 4.2],
       [35, 4.9, 3.1],
       [36, 5.0, 3.2],
       [37, 5.5, 3.5],
       [38, 4.9, 3.1],
       [39, 4.4, 3.0],
       [40, 5.1, 3.4],
       [41, 5.0, 3.5],
       [42, 4.5, 2.3],
       [43, 4.4, 3.2],
       [44, 5.0, 3.5

In [116]:
df.iloc[:100, 3]

0     1.4
1     1.4
2     1.3
3     1.5
4     1.4
     ... 
95    4.2
96    4.2
97    4.3
98    3.0
99    4.1
Name: PetalLengthCm, Length: 100, dtype: float64

In [117]:
arr[:100, 3]

array([1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4,
       1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5, 1.7, 1.5, 1.0, 1.7, 1.9, 1.6,
       1.6, 1.5, 1.4, 1.6, 1.6, 1.5, 1.5, 1.4, 1.5, 1.2, 1.3, 1.5, 1.3,
       1.5, 1.3, 1.3, 1.3, 1.6, 1.9, 1.4, 1.6, 1.4, 1.5, 1.4, 4.7, 4.5,
       4.9, 4.0, 4.6, 4.5, 4.7, 3.3, 4.6, 3.9, 3.5, 4.2, 4.0, 4.7, 3.6,
       4.4, 4.5, 4.1, 4.5, 3.9, 4.8, 4.0, 4.9, 4.7, 4.3, 4.4, 4.8, 5.0,
       4.5, 3.5, 3.8, 3.7, 3.9, 5.1, 4.5, 4.5, 4.7, 4.4, 4.1, 4.0, 4.4,
       4.6, 4.0, 3.3, 4.2, 4.2, 4.2, 4.3, 3.0, 4.1], dtype=object)

In [118]:
df.iloc[[1], :]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
1,2,4.9,3.0,1.4,0.2,Iris-setosa


In [119]:
arr[[1], :]

array([[2, 4.9, 3.0, 1.4, 0.2, 'Iris-setosa']], dtype=object)

# Other Manipulation Techniques

Imagine you want to add another column that contains numerical representation of categorical column - Species.  For instance, instead of Iris-setosa, Iris-versicolor, or Iris-virginica, let's have 0, 1, and 2 correspondingly.

In [120]:
df['Species'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [121]:
df['Species']

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: Species, Length: 150, dtype: object

We create another column called Species_num (numerical species). Initially, it has exactly the same values with the original column. 

In [122]:
df['Species_num'] = df['Species'].copy()

In [123]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,Species_num
0,1,5.1,3.5,1.4,0.2,Iris-setosa,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa,Iris-setosa


In [124]:
(df['Species'] == df['Species_num']).all()

True

We create a dictionary to indicate how to convert string-to-number, and then apply it on the column.

In [125]:
converter = {'Iris-setosa': 0, 
             'Iris-versicolor': 1,
             'Iris-virginica': 2
            }

In [126]:
converter

{'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}

In [127]:
df['Species_num'] = df['Species_num'].map(converter)

In [128]:
df['Species_num']

0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Name: Species_num, Length: 150, dtype: int64

Let's check if those are correctly converted.

In [129]:
condition = (df['Species'] == 'Iris-virginica')

In [130]:
# All of them are 2!
df['Species_num'].loc[condition]

100    2
101    2
102    2
103    2
104    2
105    2
106    2
107    2
108    2
109    2
110    2
111    2
112    2
113    2
114    2
115    2
116    2
117    2
118    2
119    2
120    2
121    2
122    2
123    2
124    2
125    2
126    2
127    2
128    2
129    2
130    2
131    2
132    2
133    2
134    2
135    2
136    2
137    2
138    2
139    2
140    2
141    2
142    2
143    2
144    2
145    2
146    2
147    2
148    2
149    2
Name: Species_num, dtype: int64

In [131]:
# Another way of checking
# To see if there is one element at least not being equal to 2
(df['Species_num'].loc[condition] != 2).any()

False

There will be cases that you want to understand particular data patterns with respect to each category. For instance, what are average lengths of sepal in each species?

In [133]:
# group object
go = df.groupby('Species')

We initialized the groupby object to group the data with respect to Species column such that further manipulations will be conducted with each category. Let's do manipulation.

In [134]:
go.mean()

Unnamed: 0_level_0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species_num
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Iris-setosa,25.5,5.006,3.418,1.464,0.244,0.0
Iris-versicolor,75.5,5.936,2.77,4.26,1.326,1.0
Iris-virginica,125.5,6.588,2.974,5.552,2.026,2.0


What we did is again to calculate mean of all the (numerical) columns with respect to categories in Species column. But we are interested in SepalLengthCm column only.

The way to call that column is same with typical DataFrame.

In [136]:
go['SepalLengthCm']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002295BB7FF10>

Calculating mean.

In [137]:
go['SepalLengthCm'].mean()

Species
Iris-setosa        5.006
Iris-versicolor    5.936
Iris-virginica     6.588
Name: SepalLengthCm, dtype: float64