# **Data Loading and Initial Overview**

In this step, we first bring the data into the system and then take a basic look at it before doing any analysis.

**Data Loading**

Data loading means opening the data file and making it ready to use.
The dataset is stored in an Excel file, so it is imported into the system using a tool that can read Excel data. Once loaded, the data is arranged in a table format with rows and columns, similar to how it looks in Excel. This allows us to easily view, sort, and analyze the data.

Checking Number of Rows and Columns

After loading the data, we check how much data is present.


*   Rows show how many records or entries are there
*  Columns show how many types of information are collected for each record.

This helps us understand the size of the dataset and whether it is large or small.

**Understanding the Type of Data**

Each column contains a specific type of information such as numbers, text, or dates.
By checking the data types, we make sure that:


*   Numbers are treated as numbers
*   Text is treated as text



*   Dates are handled correctly

This is important because different types of data are used in different ways during analysis.

**Viewing Sample Data**

We look at the first few and last few rows of the dataset to get a basic idea of what the data looks like.
This helps us:


*   Understand what kind of values are present
*   Check if the data has been loaded correctly


*   Identify any obvious errors or missing information

It is similar to quickly checking a few pages of a register before using it.

**Getting a Summary of the Data**

Finally, we generate a simple summary that shows:


*   Basic details about each column
*   Count of available values



*   Minimum, maximum, and average values for numerical data

This summary helps us understand the overall nature of the dataset and prepares us for further analysis.

**Why This Step Is Important**

The Data Loading and Initial Overview step is important because it helps us:


*   Confirm that the data is loaded correctly
*  Understand what information is available

*   Identify issues like missing or incorrect data early



Only after this step, we can safely move on to deeper analysis and visualizations.


### **1. Importing Required Tools (Libraries)**

In [None]:
import pandas as pd
import numpy as np

Before working with data, we need some tools that help us read, organize, and analyze it easily.


*   **Pandas** is used to open and manage large datasets like Excel files.

*   **NumPy** helps with numerical calculations and handling numbers efficiently.


Think of this step like opening a calculator and a notebook before starting calculations.

### **2. Loading the Dataset into the System**

In [None]:
df = pd.read_excel("online_retail_dataset.xlsx")

Here, the dataset (which is stored in an Excel file) is loaded into the system so we can work with it.


*   The Excel file is read row by row.

*   The data is stored in a table-like structure called a DataFrame.



*   We name this table df for easy reference.


This is similar to opening an Excel sheet and keeping it ready for analysis.

### **3. Viewing the First Few Records of the Data**

In [None]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Discount,PaymentMethod,ShippingCost,Category,SalesChannel,ReturnStatus,ShipmentProvider,WarehouseLocation,OrderPriority
0,221958,SKU_1964,White Mug,38,2020-01-01 00:00:00,1.71,37039.0,Australia,0.47,Bank Transfer,10.79,Apparel,In-store,Not Returned,UPS,London,Medium
1,771155,SKU_1241,White Mug,18,2020-01-01 01:00:00,41.25,19144.0,Spain,0.19,paypall,9.51,Electronics,Online,Not Returned,UPS,Rome,Medium
2,231932,SKU_1501,Headphones,49,2020-01-01 02:00:00,29.11,50472.0,Germany,0.35,Bank Transfer,23.03,Electronics,Online,Returned,UPS,Berlin,High
3,465838,SKU_1760,Desk Lamp,14,2020-01-01 03:00:00,76.68,96586.0,Netherlands,0.14,paypall,11.08,Accessories,Online,Not Returned,Royal Mail,Rome,Low
4,359178,SKU_1386,USB Cable,-30,2020-01-01 04:00:00,-68.11,,United Kingdom,1.501433,Bank Transfer,,Electronics,In-store,Not Returned,FedEx,,Medium


This step shows the first 5 rows of the dataset.


*   It helps us understand what kind of data is present.

*   We can see column names and sample values.




*   It confirms that the data has loaded correctly.


This is like glancing at the first page of a book to understand what it’s about.

### **4. Viewing the Last Few Records of the Data**

In [None]:
df.tail()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Discount,PaymentMethod,ShippingCost,Category,SalesChannel,ReturnStatus,ShipmentProvider,WarehouseLocation,OrderPriority
49777,354083,SKU_1562,Blue Pen,25,2025-09-05 01:00:00,70.92,51445.0,Spain,0.2,Credit Card,8.96,Electronics,Online,Returned,UPS,Berlin,Medium
49778,296698,SKU_1930,USB Cable,7,2025-09-05 02:00:00,51.74,28879.0,United States,0.23,Bank Transfer,23.55,Electronics,Online,Not Returned,FedEx,Amsterdam,Low
49779,177622,SKU_1766,Office Chair,43,2025-09-05 03:00:00,85.25,21825.0,Portugal,0.2,Bank Transfer,16.26,Furniture,In-store,Not Returned,FedEx,London,High
49780,701213,SKU_1602,Notebook,48,2025-09-05 04:00:00,39.64,43199.0,United Kingdom,0.31,paypall,28.56,Apparel,Online,Not Returned,Royal Mail,London,Medium
49781,772215,SKU_1832,White Mug,30,2025-09-05 05:00:00,38.27,53328.0,France,0.1,Credit Card,9.13,Stationery,Online,Not Returned,UPS,Rome,Low


This displays the last 5 rows of the dataset.

*   It helps verify whether the data ends properly.

*   Useful for checking missing or incorrect values at the end.

This is similar to checking the last page of a register to ensure entries are complete.

### **5. Checking the Size of the Dataset (Rows and Columns)**

In [None]:
df.shape

(49782, 17)

This tells us how big the dataset is.

*   Number of rows → how many records or transactions

*   Number of columns → how many details are recorded for each entry


### **6. Getting a Complete Overview of the Dataset**

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49782 entries, 0 to 49781
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   InvoiceNo          49782 non-null  int64         
 1   StockCode          49782 non-null  object        
 2   Description        49782 non-null  object        
 3   Quantity           49782 non-null  int64         
 4   InvoiceDate        49782 non-null  datetime64[ns]
 5   UnitPrice          49782 non-null  float64       
 6   CustomerID         44804 non-null  float64       
 7   Country            49782 non-null  object        
 8   Discount           49782 non-null  float64       
 9   PaymentMethod      49782 non-null  object        
 10  ShippingCost       47293 non-null  float64       
 11  Category           49782 non-null  object        
 12  SalesChannel       49782 non-null  object        
 13  ReturnStatus       49782 non-null  object        
 14  Shipme

This provides a summary of the entire dataset.

It tells us:



*   Column names
*   Type of data in each column (number, text, date, etc.)

*  Whether any values are missing






This step helps us understand the health of the data before analysis.

### **7. Listing All Column Names**

In [None]:
df.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country', 'Discount', 'PaymentMethod',
       'ShippingCost', 'Category', 'SalesChannel', 'ReturnStatus',
       'ShipmentProvider', 'WarehouseLocation', 'OrderPriority'],
      dtype='object')

This step displays the names of all columns in the dataset.



*   Helps us understand what information is available
*   Useful when selecting specific columns for analysis


It’s like reading the headings of a form to know what details are collected.

### **8. Checking the Data Type of Each Column**

In [None]:
df.dtypes

Unnamed: 0,0
InvoiceNo,int64
StockCode,object
Description,object
Quantity,int64
InvoiceDate,datetime64[ns]
UnitPrice,float64
CustomerID,float64
Country,object
Discount,float64
PaymentMethod,object


This shows the type of data stored in each column, such as:




*   Numbers

*   Text



*   Dates



Knowing data types is important to:

Perform correct calculations

Avoid errors during analysis

This is similar to knowing whether a field stores numbers, names, or dates.

### **9. Getting Statistical Summary of the Data**

In [None]:
df.describe()

Unnamed: 0,InvoiceNo,Quantity,InvoiceDate,UnitPrice,CustomerID,Discount,ShippingCost
count,49782.0,49782.0,49782,49782.0,44804.0,49782.0,47293.0
mean,550681.239946,22.372343,2022-11-03 02:30:00,47.537862,55032.871775,0.275748,17.494529
min,100005.0,-50.0,2020-01-01 00:00:00,-99.98,10001.0,0.0,5.0
25%,324543.0,11.0,2021-06-02 13:15:00,23.5925,32750.75,0.13,11.22
50%,552244.0,23.0,2022-11-03 02:30:00,48.92,55165.0,0.26,17.5
75%,776364.0,37.0,2024-04-04 15:45:00,74.61,77306.25,0.38,23.72
max,999997.0,49.0,2025-09-05 05:00:00,100.0,99998.0,1.999764,30.0
std,260703.009944,17.917774,,33.47951,25913.660157,0.230077,7.220557


This gives a numerical summary of the dataset.

It shows:



*   Minimum and maximum values
*   Average values





*   Overall spread of numerical data





This helps understand:



*   Sales range

*   Quantity distribution







*   Possible outliers



Think of this as a summary report that highlights important numerical insights.

## **In this file, the dataset is reloaded and required libraries are imported again to ensure independent execution before handling missing values.**