Data Wrangling, I
Perform the following operations using Python on any open source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g., https://www.kaggle.com). Provide a clear
 description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas dataframe.
4. Data Preprocessing: check for missing values in the data using pandas isnull(), describe()
function to get some initial statistics. Provide variable descriptions. Types of variables etc.
Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking
the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the
data set. If variables are not in the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.
In addition to the codes and outputs, explain every operation that you do in the above steps and
explain everything that you do to import/read/scrape the data set.


In [2]:
#import required libaries
import pandas as pd   # data manipulation and analysis
import numpy as np    # supports numerical operations
import seaborn as sns   # used for data visualization
import matplotlib.pyplot as plt    # used for data visualization

In [3]:
# Dataset: Iris Flower Dataset
# Source: https://www.kaggle.com/datasets/uciml/iris
# Description: Contains 150 samples with sepal and petal measurements for 3 iris species.

#load dataset
#df = sns.load_dataset('iris')

df =pd.read_csv('Dataset/Iris.csv')      # read_csv function loads dataset
print(df)      # display dataframe

      Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm  \
0      1            5.1           3.5            1.4           0.2   
1      2            4.9           3.0            1.4           0.2   
2      3            4.7           3.2            1.3           0.2   
3      4            4.6           3.1            1.5           0.2   
4      5            5.0           3.6            1.4           0.2   
..   ...            ...           ...            ...           ...   
145  146            6.7           3.0            5.2           2.3   
146  147            6.3           2.5            5.0           1.9   
147  148            6.5           3.0            5.2           2.0   
148  149            6.2           3.4            5.4           2.3   
149  150            5.9           3.0            5.1           1.8   

            Species  
0       Iris-setosa  
1       Iris-setosa  
2       Iris-setosa  
3       Iris-setosa  
4       Iris-setosa  
..              ...  
145  

Data Preprocessing Begins

In [4]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [5]:
df.tail()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica
149,150,5.9,3.0,5.1,1.8,Iris-virginica


In [6]:
#cehck basic information
df.info()   #Displays the number of non-null entries, column names, and data types.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [7]:
#check dimensions of dataset
df.shape

(150, 6)

In [8]:
#summary statistics
df.describe()  # Gives statistical summaries of numeric columns

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [9]:
# check for missing values
df.isnull().sum()   #Checks for missing values in each column and returns the count of nulls per column.

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

Data preprocessing ends

In [10]:
#Check data types
df.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

In [11]:
#Convert 'Species' to categorical if not already
df['Species'] = df['Species'].astype('category')

In [12]:
df.dtypes

Id                  int64
SepalLengthCm     float64
SepalWidthCm      float64
PetalLengthCm     float64
PetalWidthCm      float64
Species          category
dtype: object

In [13]:
# all dataset convert to common scale
from sklearn.preprocessing import MinMaxScaler
scalar = MinMaxScaler()
df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']] = scalar.fit_transform(df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']])

#Imports MinMaxScaler to normalize values between 0 and 1. Applies scaling to all numeric feature columns. This ensures that all features are on the same scale, which is important for many ML algorithms.

In [14]:
print(df)

      Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm  \
0      1       0.222222      0.625000       0.067797      0.041667   
1      2       0.166667      0.416667       0.067797      0.041667   
2      3       0.111111      0.500000       0.050847      0.041667   
3      4       0.083333      0.458333       0.084746      0.041667   
4      5       0.194444      0.666667       0.067797      0.041667   
..   ...            ...           ...            ...           ...   
145  146       0.666667      0.416667       0.711864      0.916667   
146  147       0.555556      0.208333       0.677966      0.750000   
147  148       0.611111      0.416667       0.711864      0.791667   
148  149       0.527778      0.583333       0.745763      0.916667   
149  150       0.444444      0.416667       0.694915      0.708333   

            Species  
0       Iris-setosa  
1       Iris-setosa  
2       Iris-setosa  
3       Iris-setosa  
4       Iris-setosa  
..              ...  
145  

In [15]:
#Convert 'spcies' to numerical codes (turn categorical variable into quantitative variable)
df['encoded_species'] = df['Species'].cat.codes
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,encoded_species
0,1,0.222222,0.625,0.067797,0.041667,Iris-setosa,0
1,2,0.166667,0.416667,0.067797,0.041667,Iris-setosa,0
2,3,0.111111,0.5,0.050847,0.041667,Iris-setosa,0
3,4,0.083333,0.458333,0.084746,0.041667,Iris-setosa,0
4,5,0.194444,0.666667,0.067797,0.041667,Iris-setosa,0


In [16]:
df.tail()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,encoded_species
145,146,0.666667,0.416667,0.711864,0.916667,Iris-virginica,2
146,147,0.555556,0.208333,0.677966,0.75,Iris-virginica,2
147,148,0.611111,0.416667,0.711864,0.791667,Iris-virginica,2
148,149,0.527778,0.583333,0.745763,0.916667,Iris-virginica,2
149,150,0.444444,0.416667,0.694915,0.708333,Iris-virginica,2


In [17]:
# converts categorical variables to binary code
df = pd.get_dummies(df,columns = ['Species'])

# Converts the Species column into multiple binary columns using one-hot encoding. For example: Species_setosa, Species_versicolor, Species_virginica.

In [18]:
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,encoded_species,Species_Iris-setosa,Species_Iris-versicolor,Species_Iris-virginica
0,1,0.222222,0.625000,0.067797,0.041667,0,True,False,False
1,2,0.166667,0.416667,0.067797,0.041667,0,True,False,False
2,3,0.111111,0.500000,0.050847,0.041667,0,True,False,False
3,4,0.083333,0.458333,0.084746,0.041667,0,True,False,False
4,5,0.194444,0.666667,0.067797,0.041667,0,True,False,False
...,...,...,...,...,...,...,...,...,...
145,146,0.666667,0.416667,0.711864,0.916667,2,False,False,True
146,147,0.555556,0.208333,0.677966,0.750000,2,False,False,True
147,148,0.611111,0.416667,0.711864,0.791667,2,False,False,True
148,149,0.527778,0.583333,0.745763,0.916667,2,False,False,True
