# **Data Preprocessing**

# **Key Features of Pandas**
**DataFrame:** A two-dimensional labeled data structure for working with tabular data.

**Data Manipulation:** Powerful tools for filtering, transforming, and aggregating data.

**Data Cleaning:** Functions to handle missing values, duplicate data, and outliers.

**Data Input/Output:** Read and write data from various file formats, including CSV and Excel.

## **Importing the installed Libraries**

In [2]:
import pandas as pd

## **1. Load the Dataset**
First, we need to load the dataset into a pandas DataFrame. We’ll use the Iris dataset as an example.

In [3]:
df = pd.read_csv("iris.csv")

## **2. Basic Data Exploration**
Now, let’s apply various methods to explore the dataset.

## **2.1. Displaying the Top Rows of the Data**

The **head()** function is used to display the first few rows of a DataFrame or a data structure containing tabular data.

It is primarily used for data exploration and getting a quick overview of the data.

In [4]:
# Display the first 5 rows
df.head(3)


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


## **2.2. Displaying the Last rows on the data**

The **tail()** function is used to view the last few rows of a DataFrame or a file.

It is a convenient way to quickly inspect the end of a dataset or file without having to display the entire contents.

In [5]:
# Display the last 5 rows
df.tail(1)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
149,5.9,3.0,5.1,1.8,virginica


### **2.3. Shape of the Data**

In pandas, the **.shape** attribute is used to determine the dimensions of a DataFrame or Series. It returns a tuple containing the number of rows and columns present in the

In [5]:
df.shape

(150, 5)

## **2.4. Data types in dataset**

In pandas, the **.dtype** function is used to determine the data type of a Series or DataFrame. It returns the data type of the elements within the object.

In [6]:
df.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

## **2.5. Description of the dataset**


The .**describe()** function in pandas is a convenient method that provides summary statistics of a DataFrame or Series. It computes various descriptive statistics, such as count, mean, standard deviation, minimum, quartiles, and maximum, for the numerical columns in the DataFrame.

Here is a breakdown of the statistics provided by the `.describe()` function:

1. **Count**: The number of non-null values in each column.
2. **Mean**: The average value of each column.
3. **Standard Deviation (std)**: The measure of the amount of variation or dispersion in each column.
4. **Minimum (min)**: The smallest value in each column.
5. **25th Percentile (25%)**: Also known as the first quartile (Q1), this represents the value below which 25% of the data falls.
6. **50th Percentile (50%)**: Also known as the median, this represents the value below which 50% of the data falls.
7. **75th Percentile (75%)**: Also known as the third quartile (Q3), this represents the value below which 75% of the data falls.
8. **Maximum (max)**: The largest value in each column.

In [7]:
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [8]:
# Summarize the DataFrame
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None


## **2.5. Data on specific location**

In [9]:
df.iloc[29]

sepal_length       4.7
sepal_width        3.2
petal_length       1.6
petal_width        0.2
species         setosa
Name: 29, dtype: object

In [10]:
df.sort_values(by = "sepal_length").head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
13,4.3,3.0,1.1,0.1,setosa
42,4.4,3.2,1.3,0.2,setosa
38,4.4,3.0,1.3,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
41,4.5,2.3,1.3,0.3,setosa


In [11]:
df[df["species"] =="versicolor"].head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
50,7.0,3.2,4.7,1.4,versicolor
51,6.4,3.2,4.5,1.5,versicolor
52,6.9,3.1,4.9,1.5,versicolor
53,5.5,2.3,4.0,1.3,versicolor
54,6.5,2.8,4.6,1.5,versicolor


## **2.6. InterQuartile Range**


The **.quantile()** method in Pandas is used to calculate the quantile values of a Series or DataFrame. A quantile represents a specific value below which a given fraction of the data falls. It provides a way to understand the distribution of numerical data by dividing it into intervals

In [12]:
Q1 = df['sepal_length'].quantile(0.25)

In [13]:
Q3 = df['sepal_length'].quantile(0.75)

In [14]:
IQR = Q3-Q1

In [15]:
print("The interquartile range is: ", IQR)

The interquartile range is:  1.3000000000000007


## **2.7. Handling the missing the data**

In [16]:
# Unique values in each column
df['species'].unique()


array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [22]:
# Check for missing values
df.isnull()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
145,False,False,False,False,False
146,False,False,False,False,False
147,False,False,False,False,False
148,False,False,False,False,False


In [23]:
# Check for missing values
print(df.isnull().sum())


sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64


In [24]:
# remove all the rows that contain a missing value
df.dropna()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [17]:
# Value counts for the 'species' column
print(df['species'].value_counts())


species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64


In [26]:
#df.fillna(df.mean())