In this notebook, we will perform an initial exploratory data analysis (EDA) on a dataset loaded from a CSV file. We will explore the structure of the DataFrame, check for missing values, and get a general overview of the data.

In [45]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Load the dataset and perform initial exploration

In [46]:
# Load the dataset
df = pd.read_csv('datasets/data.csv')
# Show the number of rows and columns
print(f'Row/columns: {df.shape} \n'+'='*50)
# Display column titles
print(f'Column titles: {df.columns} \n'+'='*50)
# Display data types of each column
print(f'Data types:\n{df.dtypes} \n')

Row/columns: (32, 5) 
Column titles: Index(['Duration', 'Date', 'Pulse', 'Maxpulse', 'Calories'], dtype='object') 
Data types:
Duration      int64
Date         object
Pulse         int64
Maxpulse      int64
Calories    float64
dtype: object 



Fix the data type of the column with incorrect type

In [47]:
df["Date"] = pd.to_datetime(df["Date"])
print(df.dtypes["Date"])

datetime64[ns]


Get summary information about the DataFrame

In [48]:
# Display summary information about the DataFrame
print(f'{df.info()}\n')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Duration  32 non-null     int64         
 1   Date      32 non-null     datetime64[ns]
 2   Pulse     32 non-null     int64         
 3   Maxpulse  32 non-null     int64         
 4   Calories  30 non-null     float64       
dtypes: datetime64[ns](1), float64(1), int64(3)
memory usage: 1.4 KB
None



Overview of values in the DataFrame

In [49]:
# General overview of the DataFrame
# Display the first and last 5 rows of the DataFrame
print(f'First 5 rows:\n{df.head()}\n'+'='*50)
# Display the last 5 rows of the DataFrame
print(f'Last 5 rows:\n{df.tail()}\n'+'='*50)

First 5 rows:
   Duration       Date  Pulse  Maxpulse  Calories
0        60 2020-12-01    110       130     409.1
1        60 2020-12-02    117       145     479.0
2        60 2020-12-03    103       135     340.0
3        45 2020-12-04    109       175     282.4
4        45 2020-12-05    117       148     406.0
Last 5 rows:
    Duration       Date  Pulse  Maxpulse  Calories
27        60 2020-12-27     92       118     241.0
28        60 2020-12-28    103       132       NaN
29        60 2020-12-29    100       132     280.0
30        60 2020-12-30    102       129     380.3
31        60 2020-12-31     92       115     243.0


Missing values analysis

In [50]:
# General overview of missing values in the DataFrame
na_data = pd.DataFrame({'numb_na': df.isna().sum(),
                        'perc_na': (df.isna().sum() / df.shape[0] * 100).round(2)})
print(f'Missing values overview:\n{na_data.sort_values('perc_na', ascending=False)}\n'+'='*50)
# Cleaning the table by filling missing values
df_clean = df.copy()
df_clean = df_clean.fillna(df['Calories'].mean())
print(f'Missing values after cleaning:\n{df_clean.isna().sum()})')

Missing values overview:
          numb_na  perc_na
Calories        2     6.25
Duration        0     0.00
Date            0     0.00
Pulse           0     0.00
Maxpulse        0     0.00
Missing values after cleaning:
Duration    0
Date        0
Pulse       0
Maxpulse    0
Calories    0
dtype: int64)


Statistical summary of a chosen column

In [51]:
print(f'Statistical summary of chosen column:\n{df['Calories'].describe()}\n'+'='*50)

Statistical summary of chosen column:
count     30.000000
mean     304.680000
std       66.003779
min      195.100000
25%      250.700000
50%      291.200000
75%      343.975000
max      479.000000
Name: Calories, dtype: float64


Outlier detection using the IQR method

In [52]:
# Identify outliers in the 'Calories' column using the IQR method
q1 = df['Calories'].quantile(0.25)
q3 = df['Calories'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
print(f'IQR: {iqr:.2f} \nLower Bound: {lower_bound:.2f} \nUpper Bound: {upper_bound:.2f}\n'+'='*50)
# Find and display outliers
outliers = df[(df['Calories'] < lower_bound) | (df['Calories'] > upper_bound)]
print(f'Outliers in Calories column:\n{outliers}\n'+'='*50)

IQR: 93.28 
Lower Bound: 110.79 
Upper Bound: 483.89
Outliers in Calories column:
Empty DataFrame
Columns: [Duration, Date, Pulse, Maxpulse, Calories]
Index: []


Create a new column and filling missing values with the mean

In [53]:
is_calories_missing = pd.DataFrame({'is_calories_missing': df['Calories'].fillna(df['Calories'].mean())})
print(f'Calories column with missing values filled:\n{is_calories_missing}\n'+'='*50)

Calories column with missing values filled:
    is_calories_missing
0                409.10
1                479.00
2                340.00
3                282.40
4                406.00
5                300.00
6                374.00
7                253.30
8                195.10
9                269.00
10               329.30
11               250.70
12               250.70
13               345.30
14               379.30
15               275.00
16               215.20
17               300.00
18               304.68
19               323.00
20               243.00
21               364.20
22               282.00
23               300.00
24               246.00
25               334.50
26               250.00
27               241.00
28               304.68
29               280.00
30               380.30
31               243.00


Calculate the range of the pulse values

In [54]:
pulse_range = pd.DataFrame({'pulse_range': (df['Pulse'].max() - df['Pulse'])})
print(f'Pulse range:\n{pulse_range}')

Pulse range:
    pulse_range
0            20
1            13
2            27
3            21
4            13
5            28
6            20
7            26
8            21
9            32
10           27
11           30
12           30
13           24
14           26
15           32
16           32
17           30
18           40
19           27
20           33
21           22
22           30
23            0
24           25
25           28
26           30
27           38
28           27
29           30
30           28
31           38


Duration category based on Duration values

In [55]:
df.insert(1, 'duration_category', pd.cut(df['Duration'], bins=[0, 29, 61, np.inf], labels=['Short', 'Medium', 'Long']))
print(f'DataFrame with new duration_category column:\n{df.head()}\n', df.tail())

DataFrame with new duration_category column:
   Duration duration_category       Date  Pulse  Maxpulse  Calories
0        60            Medium 2020-12-01    110       130     409.1
1        60            Medium 2020-12-02    117       145     479.0
2        60            Medium 2020-12-03    103       135     340.0
3        45            Medium 2020-12-04    109       175     282.4
4        45            Medium 2020-12-05    117       148     406.0
     Duration duration_category       Date  Pulse  Maxpulse  Calories
27        60            Medium 2020-12-27     92       118     241.0
28        60            Medium 2020-12-28    103       132       NaN
29        60            Medium 2020-12-29    100       132     280.0
30        60            Medium 2020-12-30    102       129     380.3
31        60            Medium 2020-12-31     92       115     243.0


Aggregate data by duration category

In [56]:
df_agg = df.groupby('duration_category').agg({'Pulse': 'mean', 'Calories': 'mean'}).reset_index()
print(f'Aggregated DataFrame:\n{df_agg}')

Aggregated DataFrame:
  duration_category  Pulse    Calories
0             Short  106.0  345.300000
1            Medium  103.4  305.064286
2              Long  104.0  253.300000


  df_agg = df.groupby('duration_category').agg({'Pulse': 'mean', 'Calories': 'mean'}).reset_index()
