# Background: 
The Iris Dataset contains four features (length and width of sepals and petals) of 50 samples of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). These measures were used to create a linear discriminant model to classify the species. The dataset is often used in data mining, classification and clustering examples and to test algorithms.

## Data Description

Petal Length (cm)

Petal Width (cm)

Sepal Length (cm)

Sepal Width (cm)

Species - Sentosa, Versicolour, and Virginica 

![sepal vs petal](https://yculz33w9skgdkhey8rajqm6-wpengine.netdna-ssl.com/wp-content/uploads/2018/11/versicolor.jpg)

### Getting Started with Pandas:

In [1]:
import pandas as pd 

print(pd.__version__)

1.3.1


### Load the dataset
The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper. This is a very famous and widely used dataset by everyone trying to learn machine learning. 

The dataset is available in the [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets/iris). we will load this data from a library called seaborn.

In [9]:
conda install --upgrade pandas -y


Note: you may need to restart the kernel to use updated packages.


usage: conda-script.py [-h] [-V] command ...
conda-script.py: error: unrecognized arguments: --upgrade


In [10]:
import pandas as pd
import seaborn


data = seaborn.load_dataset("iris")
type(data)

pandas.core.frame.DataFrame

### Check if the dataset has been loaded correctly

In [11]:
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [19]:
# Looking at the first few rows
data.head()

Unnamed: 0,0,1,2,3,4
sepal_length,5.1,4.9,4.7,4.6,5.0
sepal_width,3.5,3.0,3.2,3.1,3.6
petal_length,1.4,1.4,1.3,1.5,1.4
petal_width,0.2,0.2,0.2,0.2,0.2
species,setosa,setosa,setosa,setosa,setosa


In [20]:
# Looking at the first few rows again
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [24]:
# Looking at the last few rows of the data frame
data.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


We can save this dataset to our local system as a csv file.

### Export dataframe as csv

In [69]:
data.to_csv('iris.csv', index=False) # Saves the file in the same folder that contains the notebook

In [26]:
df = pd.read_csv("iris.csv")
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [29]:
(df == data).sum()

sepal_length    150
sepal_width     150
petal_length    150
petal_width     150
species         150
dtype: int64

Let us now look at the data itself

### Displaying the number of rows randomly

In [32]:
data.sample?

[1;31mSignature:[0m
[0mdata[0m[1;33m.[0m[0msample[0m[1;33m([0m[1;33m
[0m    [0mn[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mfrac[0m[1;33m:[0m [1;34m'float | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mreplace[0m[1;33m:[0m [1;34m'bool_t'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mweights[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0maxis[0m[1;33m:[0m [1;34m'Axis | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mignore_index[0m[1;33m:[0m [1;34m'bool_t'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'FrameOrSeries'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a random sample of items from an axis of object.

You can use `random_state` for reproducibility.

Parameters
----------
n : int, optional
    Number of items from axis t

### Check out the shape of the dataset

In [33]:
data.shape

(150, 5)

The dataset has 150 rows of observations and 5 columns.

### Slicing the rows
If you want to print or work upon a particular group of lines that is from say 10th row to 20th row.

In [36]:
# data[start: end] 
# start is inclusive whereas end is exclusive 
display(data[10: 21])  # used to be:   print(data[10: 21])
# it will print the rows from 10 to 20. 
  
# you can also save it in a variable for further use in analysis 
sliced_data = data[10: 21] 
display(sliced_data) # used to be:   print(sliced_data)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
10,5.4,3.7,1.5,0.2,setosa
11,4.8,3.4,1.6,0.2,setosa
12,4.8,3.0,1.4,0.1,setosa
13,4.3,3.0,1.1,0.1,setosa
14,5.8,4.0,1.2,0.2,setosa
15,5.7,4.4,1.5,0.4,setosa
16,5.4,3.9,1.3,0.4,setosa
17,5.1,3.5,1.4,0.3,setosa
18,5.7,3.8,1.7,0.3,setosa
19,5.1,3.8,1.5,0.3,setosa


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
10,5.4,3.7,1.5,0.2,setosa
11,4.8,3.4,1.6,0.2,setosa
12,4.8,3.0,1.4,0.1,setosa
13,4.3,3.0,1.1,0.1,setosa
14,5.8,4.0,1.2,0.2,setosa
15,5.7,4.4,1.5,0.4,setosa
16,5.4,3.9,1.3,0.4,setosa
17,5.1,3.5,1.4,0.3,setosa
18,5.7,3.8,1.7,0.3,setosa
19,5.1,3.8,1.5,0.3,setosa


In [39]:
sliced_data[0:3]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
10,5.4,3.7,1.5,0.2,setosa
11,4.8,3.4,1.6,0.2,setosa
12,4.8,3.0,1.4,0.1,setosa


### Displaying only specific columns

In [42]:
data[["petal_width", "species"]]

Unnamed: 0,petal_width,species
0,0.2,setosa
1,0.2,setosa
2,0.2,setosa
3,0.2,setosa
4,0.2,setosa
...,...,...
145,2.3,virginica
146,1.9,virginica
147,2.0,virginica
148,2.3,virginica


In [44]:
# Select columns Petal Width and Species from iris data
# we will save it in a another variable named "specific_data" 
  
specific_data = data[["petal_width", "species"]] 
# data[["column_name1","column_name2","column_name3"]] 
  
# now we will print the first 10 columns of the specific_data dataframe. 
display(specific_data.head(10)) 

Unnamed: 0,petal_width,species
0,0.2,setosa
1,0.2,setosa
2,0.2,setosa
3,0.2,setosa
4,0.2,setosa
5,0.4,setosa
6,0.3,setosa
7,0.2,setosa
8,0.2,setosa
9,0.1,setosa


### Kaspar edit:  print() vs display()
* print() is more universal, but can be uglier.   
* display() only works in iPython (aka Jupyter notebooks & a few other places) but it looks nicer.

### Calculating sum, mean, median and mode of a particular column

In [50]:
data.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

In [54]:
data[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].sum()

sepal_length    876.5
sepal_width     458.6
petal_length    563.7
petal_width     179.9
dtype: float64

In [55]:
# data["column_name"].sum() 
  
sum_data = data['sepal_length'].sum() 
mean_data = data['sepal_length'].mean() 
median_data = data['sepal_length'].median() 
mode_data = data['sepal_length'].mode() 
  
print("Sum:", sum_data, "\nMean:", mean_data, "\nMedian:", median_data, "\nMode:", mode_data) 

Sum: 876.5 
Mean: 5.843333333333334 
Median: 5.8 
Mode: 0    5.0
dtype: float64


In [56]:
## Kaspar Edit:  what's going on w/ "df.mode()"?  Does it have two answers?
len(mode_data)

1

In [58]:
type(mode_data)

pandas.core.series.Series

In [60]:
has_one = pd.DataFrame({"nums": [5,5,5,5,5,5,1,2,3,4,6]})
has_one.mode()

has_multiple = pd.DataFrame({"nums": [5,5,5,3,3,3,1,1,1]})
has_multiple.mode()

Unnamed: 0,nums
0,1
1,3
2,5


### Calculating sum, mean and mode of a particular Species

In [62]:
data['species'].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [63]:
data

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [64]:
data.loc[data['species'] == 'setosa', 'sepal_length']

0     5.1
1     4.9
2     4.7
3     4.6
4     5.0
5     5.4
6     4.6
7     5.0
8     4.4
9     4.9
10    5.4
11    4.8
12    4.8
13    4.3
14    5.8
15    5.7
16    5.4
17    5.1
18    5.7
19    5.1
20    5.4
21    5.1
22    4.6
23    5.1
24    4.8
25    5.0
26    5.0
27    5.2
28    5.2
29    4.7
30    4.8
31    5.4
32    5.2
33    5.5
34    4.9
35    5.0
36    5.5
37    4.9
38    4.4
39    5.1
40    5.0
41    4.5
42    4.4
43    5.0
44    5.1
45    4.8
46    5.1
47    4.6
48    5.3
49    5.0
Name: sepal_length, dtype: float64

In [79]:
# Species == 'Iris-setosa'

sum_data_sentosa = data.loc[data['species'] == 'setosa', 'sepal_length'].sum() 
mean_data_sentosa = data.loc[data['species'] == 'setosa', 'sepal_length'].mean() 
median_data_sentosa = data.loc[data['species'] == 'setosa', 'sepal_length'].median() 
mode_data_sentosa = data.loc[data['species'] == 'setosa', 'sepal_length'].mode() 
  
print("Sum:", sum_data_sentosa, 
      "\nMean:", mean_data_sentosa, 
      "\nMedian:", median_data_sentosa, 
      "\nMode:", mode_data_sentosa
     ) 

Sum: 250.3 
Mean: 5.006 
Median: 5.0 
Mode: 0    5.0
1    5.1
dtype: float64


In [73]:
print("abc\n123")

abc
123


In [66]:
data.loc[data['species'] == 'setosa', 'sepal_length'].std() 

0.35248968721345136

In [82]:
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [86]:
data.loc[27:30, 2:4]

TypeError: cannot do slice indexing on Index with these indexers [2] of type int

groupby function is very helpful when we want to analyse such information in the data.
Please try it on this dataset to practice.

We will discuss group by and several other data manipulation functions in the next session.