## Data Explorer
In this notebook, the methodology to effectively utilize the Data Explorer, an advanced tool designed for comprehensive data analysis and visualization, is outlined. The methodology encompasses the following steps:

1. **Load Data**: Seamlessly import datasets from diverse sources including CSV files, databases, and APIs. The focus will be on the **Magpie-Align/Magpie-Reasoning-150K** dataset, which has already been converted into Parquet format.
2. **Clean Data**: Efficiently manage missing values, eliminate duplicates, and perform necessary data transformations.
3. **Explore Data**: Generate descriptive statistics, create insightful visualizations, and detect underlying patterns. Data exploration will be performed to find outliers, anomalies, missing values, and so on.
4. **Model Data**: Implement machine learning algorithms to develop predictive models.
5. **Share Insights**: Export analytical results and visualizations for reporting and collaborative purposes.


### Step 1: Specify Dataset Path

In the initial step, it is imperative to define the `dataset_path` as an argument. This path will be utilized by the `DataExplorer` class to facilitate comprehensive data exploration. The precise specification of the dataset path ensures that the data exploration process is both efficient and accurate, thereby laying a solid foundation for subsequent analytical procedures.

In [18]:
import sys
import os

sys.path.append(os.path.abspath(".."))

dataset_paths = [
    os.path.abspath('../data/processed/Magpie-Align/Magpie-Reasoning-150K'),
]

### Step 2: Initialize Data Explorer
Upon instantiation, the `DataExplorer` class calls the `basic_info` method. This method is designed to display fundamental information about the dataset, including:

- **Dataset Shape**: The number of rows and columns.
- **Columns**: The names of the columns.
- **Data Types**: The data type of each column.

In [19]:
import sys
import os

sys.path.append(os.path.abspath(".."))
from src.preprocessing.data_exporation import DataExplorer

explorer = DataExplorer(dataset_paths)
explorer.basic_info()

Dataset 1 Shape: (150000, 17)
Columns: ['uuid', 'instruction', 'response', 'conversations', 'gen_input_configs', 'gen_response_configs', 'intent', 'knowledge', 'difficulty', 'difficulty_generator', 'input_quality', 'quality_explanation', 'quality_generator', 'task_category', 'other_task_category', 'task_category_generator', 'language']
Data Types:
uuid                       object
instruction                object
response                   object
conversations              object
gen_input_configs          object
gen_response_configs       object
intent                     object
knowledge                  object
difficulty                 object
difficulty_generator       object
input_quality              object
quality_explanation        object
quality_generator          object
task_category              object
other_task_category        object
task_category_generator    object
language                   object
dtype: object

