<a href="https://colab.research.google.com/github/alfonsoayalapaloma/ml-2024/blob/main/ds_eda_01_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://pandas.pydata.org/static/img/pandas.svg" width="250">


## <center> Top Pandas Functions

## IRIS Dataset

The Iris dataset is a classic dataset in the field of machine learning and statistics. It was introduced by the British biologist and statistician Ronald A. Fisher in 1936. Here's a brief overview:

### Overview
The Iris dataset contains measurements of four features (attributes) for 150 samples of iris flowers from three different species. The goal is often to classify the species based on these features.

### Features
1. **Sepal Length**: The length of the sepal (in centimeters).
2. **Sepal Width**: The width of the sepal (in centimeters).
3. **Petal Length**: The length of the petal (in centimeters).
4. **Petal Width**: The width of the petal (in centimeters).

### Target
The dataset includes three species of iris flowers:
1. **Iris setosa**
2. **Iris versicolor**
3. **Iris virginica**

Each species has 50 samples in the dataset.

### Structure
The dataset is structured as follows:
- **Rows**: Each row represents a single sample (flower).
- **Columns**: Each column represents a feature or the target species.

### Example
Here's a small sample of what the dataset looks like:

| Sepal Length | Sepal Width | Petal Length | Petal Width | Species        |
|--------------|-------------|--------------|-------------|----------------|
| 5.1          | 3.5         | 1.4          | 0.2         | Iris setosa    |
| 7.0          | 3.2         | 4.7          | 1.4         | Iris versicolor|
| 6.3          | 3.3         | 6.0          | 2.5         | Iris virginica |

### Applications
The Iris dataset is commonly used for:
- **Classification**: Building models to classify the species of iris flowers based on the features.
- **Visualization**: Exploring data visualization techniques, such as scatter plots and histograms.
- **Machine Learning**: Testing and comparing different machine learning algorithms.

It's a great dataset for beginners to practice data analysis and machine learning techniques due to it.elp with an analysis, feel free to ask!

In [None]:
import pandas as pd
import seaborn as sns

#### Import CSV with `pd.read_csv()`

In [None]:
iris = sns.load_dataset('iris')

#### Explore your data

In [None]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [None]:
iris.shape

(150, 5)

In [None]:
iris.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


In [None]:
iris.tail(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


#### Datatypes

In [None]:
iris.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

#### Subsetting your data with `loc` & `iloc`

In [None]:
iris.loc[3:5]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa


In [None]:
iris.loc[3,'sepal_length']

4.6

In [None]:
iris.iloc[3,0]

4.6

#### Export your data as csv using `to_csv`

In [None]:
iris.to_csv('iris-output.csv',index=False)