Here I am working through the Iris Dataset in order to get a foundation of knowledge for future coding projects.

In [3]:
#all needed import packages
import pandas as pd
import numpy as np
import seaborn as sns #data visualization library
import matplotlib.pyplot as plt
%matplotlib inline
# read iris data csv
iris_data = pd.read_csv("IRIS.csv")

Now that we have imported all necessary libraries and our data, lets see what our data looks like.
For the next couple prompts we are going to demonstrate some general pandas commands in order to get a feel for the library.

In [12]:
#display full data table
print(iris_data)

     sepal_length  sepal_width  petal_length  petal_width         species
0             5.1          3.5           1.4          0.2     Iris-setosa
1             4.9          3.0           1.4          0.2     Iris-setosa
2             4.7          3.2           1.3          0.2     Iris-setosa
3             4.6          3.1           1.5          0.2     Iris-setosa
4             5.0          3.6           1.4          0.2     Iris-setosa
..            ...          ...           ...          ...             ...
145           6.7          3.0           5.2          2.3  Iris-virginica
146           6.3          2.5           5.0          1.9  Iris-virginica
147           6.5          3.0           5.2          2.0  Iris-virginica
148           6.2          3.4           5.4          2.3  Iris-virginica
149           5.9          3.0           5.1          1.8  Iris-virginica

[150 rows x 5 columns]


Now, lets use function head() in order to see our top 5 rows of data.
Note: head() default shows the top 5 rows without a provided argument

In [5]:
iris_data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Now, lets use the sample() function to display random rows of our data. For our example lets display 20 random rows of data.
Note: the number of rows displayed is based on our argument. 

In [6]:
iris_data.sample(20)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
6,4.6,3.4,1.4,0.3,Iris-setosa
22,4.6,3.6,1.0,0.2,Iris-setosa
136,6.3,3.4,5.6,2.4,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
49,5.0,3.3,1.4,0.2,Iris-setosa
70,5.9,3.2,4.8,1.8,Iris-versicolor
15,5.7,4.4,1.5,0.4,Iris-setosa
135,7.7,3.0,6.1,2.3,Iris-virginica
134,6.1,2.6,5.6,1.4,Iris-virginica
37,4.9,3.1,1.5,0.1,Iris-setosa


Now lets display the number of columns and the names of each column.
Note: column() prints all the columns of our given data set in a list form.


In [7]:
iris_data.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

We are also going to display the shape of our data set.
Note: the shape of our data set means printing the total number of rows(entries) and the total number of columns(or features) of our dataset.

In [8]:
iris_data.shape

(150, 5)

Lets now demonstrate some row slicing.
Slicing is used when we want to print or work with a specific group of rows.
Syntax: data[start:end] start is inclusive where end is exclusive

In [9]:
#for this example lets start at row 5 and end at row 20
print(iris_data[5:21]) #note that since we wanna go to 20 we end at 21, since 21 will not be printed

#if we wanted to store our sliced portion of data, we can!
sliced = iris_data[10:21]

    sepal_length  sepal_width  petal_length  petal_width      species
5            5.4          3.9           1.7          0.4  Iris-setosa
6            4.6          3.4           1.4          0.3  Iris-setosa
7            5.0          3.4           1.5          0.2  Iris-setosa
8            4.4          2.9           1.4          0.2  Iris-setosa
9            4.9          3.1           1.5          0.1  Iris-setosa
10           5.4          3.7           1.5          0.2  Iris-setosa
11           4.8          3.4           1.6          0.2  Iris-setosa
12           4.8          3.0           1.4          0.1  Iris-setosa
13           4.3          3.0           1.1          0.1  Iris-setosa
14           5.8          4.0           1.2          0.2  Iris-setosa
15           5.7          4.4           1.5          0.4  Iris-setosa
16           5.4          3.9           1.3          0.4  Iris-setosa
17           5.1          3.5           1.4          0.3  Iris-setosa
18           5.7    

Now lets demonstrate if we wanted to display certain comlumns only.
Of course in any data set we will sometimes need to only work upon specific columns(or features).
Syntax: data[["column1_name", "column2_name", "column3_name"]]

In [10]:
#in our example we are gonna store our columns in spec_petalwidth

spec_petalwidth = iris_data[["species","petal_width"]]
#for fun we are going to combine using sample()
print(spec_petalwidth.sample(15)) #sample of 15

             species  petal_width
135   Iris-virginica          2.3
64   Iris-versicolor          1.3
35       Iris-setosa          0.2
12       Iris-setosa          0.1
75   Iris-versicolor          1.4
117   Iris-virginica          2.2
116   Iris-virginica          1.8
65   Iris-versicolor          1.4
56   Iris-versicolor          1.6
145   Iris-virginica          2.3
83   Iris-versicolor          1.6
126   Iris-virginica          1.8
51   Iris-versicolor          1.5
54   Iris-versicolor          1.5
62   Iris-versicolor          1.0


Now lets demonstrate some filtering. We are going to display specific rows using the "iloc" and "loc" functions.
"loc" uses the index name of the row to display the particular row of the dataset.
"iloc" uses the index number of the row, which gives complete information about the row.

In [13]:
#first we are going to demonstrate iloc
iris_data.iloc[17]

#Then we are going to demonstrate the use of loc
iris_data.loc[iris_data["species"] == "Iris-virginica"]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
100,6.3,3.3,6.0,2.5,Iris-virginica
101,5.8,2.7,5.1,1.9,Iris-virginica
102,7.1,3.0,5.9,2.1,Iris-virginica
103,6.3,2.9,5.6,1.8,Iris-virginica
104,6.5,3.0,5.8,2.2,Iris-virginica
105,7.6,3.0,6.6,2.1,Iris-virginica
106,4.9,2.5,4.5,1.7,Iris-virginica
107,7.3,2.9,6.3,1.8,Iris-virginica
108,6.7,2.5,5.8,1.8,Iris-virginica
109,7.2,3.6,6.1,2.5,Iris-virginica


Let's now demonstrate the use of the value_counts() function. This function counts the number of unique values, a particular instance, or data has occured. 
For the example below, we use the species column(as per in the example/guides i worked through below). It will count the number of times each species occurs.

In [14]:
iris_data["species"].value_counts()

species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

Let's play with calculating the mean, sum, and modes of a certain column. For my own personal interest I want to work with the Petal Width variable.

In [17]:
sum_data = iris_data["petal_width"].sum()
mean_data = iris_data["petal_width"].mean()
mode_data = iris_data["petal_width"].mode()
#for fun lets also do median(this is as is in the tut)
median_data = iris_data["petal_width"].median()
print(f"Sum: {sum_data}, \nMean: {mean_data}, \nMedian: {median_data}, \nMode: {mode_data}")

Sum: 179.8, 
Mean: 1.1986666666666668, 
Median: 1.3, 
Mode: 0    0.2
Name: petal_width, dtype: float64


Now here is the use of min() and max() functions

In [18]:
min = iris_data["petal_width"].min()
max = iris_data["petal_width"].max()
print(f"Petal Width min: {min} \nPetal Width max: {max}")

Petal Width min: 0.1 
Petal Width max: 2.5


Moving on to something slightly more complex. We're gonna follow our tut and work through an example of using the dataframe.syle function. following this example, we are gonna highlight the max and mins from each column and row.
We do this by using the style.apply function. 

In [24]:

iris_data.style

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [25]:
#first lets do columns, which will mean axis=0
iris_data.style.highlight_max(color="green", axis=0)
#next lets do rows, axis=1
iris_data.style.highlight_max(color="green", axis=1)
#next lets do our entire table, meaning axis=None
iris_data.style.highlight_max(color="green", axis =None)


UFuncTypeError: ufunc 'greater_equal' did not contain a loop with signature matching types (<class 'numpy.dtypes.Float64DType'>, <class 'numpy.dtypes.StrDType'>) -> None

<pandas.io.formats.style.Styler at 0x2316443f310>