## Data Preparation

In [15]:
import pandas as pd
import numpy as np

from sklearn.datasets import load_iris

iris = load_iris()
iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])

In [2]:
# Dataframe shape
iris_df.shape

(150, 4)

We are dealing with 150 rows and 4 columns of data. Each row represents one datapoint and each column represents a single feature associated with the data frame. So basically, there are 150 datapoints containing 4 features each

In [3]:
# Dataframe columns
iris_df.columns

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')

As we can see, there are four(4) columns. The columns attribute tells us the name of the columns and basically nothing else. This attribute assumes importance when we want to identify the features a dataset contains.

In [5]:
# Dataframe info
iris_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB


We can observe the following:

1. The DataType of each column: In this dataset, all of the data is stored as 64-bit floating-point numbers.
2. Number of Non-Null values: Dealing with null values is an important step in data preparation. It will be dealt with later in the notebook.

In [6]:
# Dataframe describe
iris_df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


The output above shows the total number of data points, mean, standard deviation, minimum, lower quartile(25%), median(50%), upper quartile(75%) and the maximum value of each column

In [7]:
# Dataframe head
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


By default, head outputs the first five rows of the dataframe. However, you can enter a specific number you want to see using head(n) where n is the first number of rows you are interested in.

In [8]:
# Dataframe tail
iris_df.tail()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3
149,5.9,3.0,5.1,1.8


Tail displays the last five rows of the dataframe. As with head, you can enter the specific number of rows you are interested in using tail(n) where n is the last number of rows you are interested.

#### Handling Missing Values

In [17]:
example1 = pd.Series([0, np.nan, '', None])

example1.isnull()

0    False
1     True
2    False
3     True
dtype: bool

This creates a mask the display the null and not null values in a dataset

In [19]:
example1.isnull().sum()

2

This sums the null values in the dataframe

### Dealing with missing values

In [22]:
example2 = pd.DataFrame([[1,      np.nan, 7], 
                         [2,      5,      8], 
                         [np.nan, 6,      9]])
example2

Unnamed: 0,0,1,2
0,1.0,,7
1,2.0,5.0,8
2,,6.0,9


In [23]:
example2.dropna()

Unnamed: 0,0,1,2
1,2.0,5.0,8


This deletes NaN values across rows

In [25]:
example2.dropna(axis='columns')

Unnamed: 0,2
0,7
1,8
2,9


This deletes NaN values across columns

### Filling null values

In [26]:
fill_with_mode = pd.DataFrame([[1,2,"True"],
                               [3,4,None],
                               [5,6,"False"],
                               [7,8,"True"],
                               [9,10,"True"]])

fill_with_mode

Unnamed: 0,0,1,2
0,1,2,True
1,3,4,
2,5,6,False
3,7,8,True
4,9,10,True


In [28]:
fill_with_mode[2].isnull()

0    False
1     True
2    False
3    False
4    False
Name: 2, dtype: bool

In [29]:
fill_with_mode[2].fillna('True',inplace=True)

In [30]:
fill_with_mode

Unnamed: 0,0,1,2
0,1,2,True
1,3,4,True
2,5,6,False
3,7,8,True
4,9,10,True


In [31]:
fill_with_mean = pd.DataFrame([[-2,0,1],
                               [-1,2,3],
                               [np.nan,4,5],
                               [1,6,7],
                               [2,8,9]])

fill_with_mean

Unnamed: 0,0,1,2
0,-2.0,0,1
1,-1.0,2,3
2,,4,5
3,1.0,6,7
4,2.0,8,9
