# Background: 
The Iris Dataset contains four features (length and width of sepals and petals) of 50 samples of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). These measures were used to create a linear discriminant model to classify the species. The dataset is often used in data mining, classification and clustering examples and to test algorithms.

## Data Description

Petal Length - in cm

Petal Width - in cm

Sepal Length - in cm

Sepal Width - in cm

Species - Sentosa, Versicolour, and Virginica 

### Getting Started with Pandas:

In [1]:
import pandas as pd 

### Load the dataset
The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper. This is a very famous and widely used dataset by everyone trying to learn machine learning. 

The dataset is available in the [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets/iris). we will load this data from a library called seaborn.

In [1]:
import pandas as pd
import seaborn
data = seaborn.load_dataset("iris")



### Check if the dataset has been loaded correctly

In [10]:
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [2]:
# Looking at the first few rows
data.head()

help(data.head)

Help on method head in module pandas.core.generic:

head(n: int = 5) -> ~FrameOrSeries method of pandas.core.frame.DataFrame instance
    Return the first `n` rows.
    
    This function returns the first `n` rows for the object based
    on position. It is useful for quickly testing if your object
    has the right type of data in it.
    
    For negative values of `n`, this function returns all rows except
    the last `n` rows, equivalent to ``df[:-n]``.
    
    Parameters
    ----------
    n : int, default 5
        Number of rows to select.
    
    Returns
    -------
    same type as caller
        The first `n` rows of the caller object.
    
    See Also
    --------
    DataFrame.tail: Returns the last `n` rows.
    
    Examples
    --------
    >>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',
    ...                    'monkey', 'parrot', 'shark', 'whale', 'zebra']})
    >>> df
          animal
    0  alligator
    1        bee
    2     falcon
  

In [12]:
# Looking at the first few rows again
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [13]:
# Looking at the last few rows of the data frame
data.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


We can save this dataset to our local system as a csv file.

### Export dataframe as csv

In [26]:
data.to_csv('iris.csv', index=False) # Saves the file in the same folder that contains the notebook

Let us now look at the data itself

### Displaying the number of rows randomly

In [14]:
data.sample(10) 

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
140,6.7,3.1,5.6,2.4,virginica
97,6.2,2.9,4.3,1.3,versicolor
37,4.9,3.6,1.4,0.1,setosa
116,6.5,3.0,5.5,1.8,virginica
148,6.2,3.4,5.4,2.3,virginica
143,6.8,3.2,5.9,2.3,virginica
66,5.6,3.0,4.5,1.5,versicolor
119,6.0,2.2,5.0,1.5,virginica
59,5.2,2.7,3.9,1.4,versicolor
41,4.5,2.3,1.3,0.3,setosa


### Check out the shape of the dataset

In [15]:
data.shape

(150, 5)

The dataset has 150 rows of observations and 5 columns.

### Slicing the rows
If you want to print or work upon a particular group of lines that is from say 10th row to 20th row.

In [16]:
# data[start:end] 
# start is inclusive whereas end is exclusive 
print(data[10:21]) 
# it will print the rows from 10 to 20. 
  
# you can also save it in a variable for further use in analysis 
sliced_data=data[10:21] 
print(sliced_data) 

    sepal_length  sepal_width  petal_length  petal_width species
10           5.4          3.7           1.5          0.2  setosa
11           4.8          3.4           1.6          0.2  setosa
12           4.8          3.0           1.4          0.1  setosa
13           4.3          3.0           1.1          0.1  setosa
14           5.8          4.0           1.2          0.2  setosa
15           5.7          4.4           1.5          0.4  setosa
16           5.4          3.9           1.3          0.4  setosa
17           5.1          3.5           1.4          0.3  setosa
18           5.7          3.8           1.7          0.3  setosa
19           5.1          3.8           1.5          0.3  setosa
20           5.4          3.4           1.7          0.2  setosa
    sepal_length  sepal_width  petal_length  petal_width species
10           5.4          3.7           1.5          0.2  setosa
11           4.8          3.4           1.6          0.2  setosa
12           4.8         

### Displaying only specific columns

In [18]:
# Select columns Petal Width and Species from iris data
# we will save it in a another variable named "specific_data" 
  
specific_data = data[["petal_width","species"]] 
# data[["column_name1","column_name2","column_name3"]] 
  
# now we will print the first 10 columns of the specific_data dataframe. 
print(specific_data.head(10)) 

   petal_width species
0          0.2  setosa
1          0.2  setosa
2          0.2  setosa
3          0.2  setosa
4          0.2  setosa
5          0.4  setosa
6          0.3  setosa
7          0.2  setosa
8          0.2  setosa
9          0.1  setosa


### Calculating sum, mean, median and mode of a particular column

In [19]:
data.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

In [20]:
# data["column_name"].sum() 
  
sum_data = data['sepal_length'].sum() 
mean_data = data['sepal_length'].mean() 
median_data = data['sepal_length'].median() 
mode_data = data['sepal_length'].mode() 
  
print("Sum:",sum_data, "\nMean:", mean_data, "\nMedian:",median_data, "\nMode:",mode_data) 

Sum: 876.5 
Mean: 5.843333333333335 
Median: 5.8 
Mode: 0    5.0
dtype: float64


### Calculating sum, mean and mode of a particular Species

In [21]:
data.species.unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [22]:
# Species == 'Iris-setosa'

sum_data_sentosa = data.loc[data['species'] == 'setosa', 'sepal_length'].sum() 
mean_data_sentosa = data.loc[data['species'] == 'setosa', 'sepal_length'].mean() 
median_data_sentosa = data.loc[data['species'] == 'setosa', 'sepal_length'].median() 
mode_data_sentosa = data.loc[data['species'] == 'setosa', 'sepal_length'].mode() 
  
print("Sum:",sum_data_sentosa, "\nMean:", mean_data_sentosa, "\nMedian:",median_data_sentosa, "\nMode:",mode_data_sentosa) 

Sum: 250.3 
Mean: 5.005999999999999 
Median: 5.0 
Mode: 0    5.0
1    5.1
dtype: float64


groupby function is very helpful when we want to analyse such information in the data.
Please try it on this dataset to practice.

We will discuss group by and several other data manipulation functions in the next session.