# Automatic EDA using `ydata_profiling`

Auther: Tassawar Abbas

Email: abbas829@gmail.com

[github](https://github.com/abbas829)

# Introduction to ydata Profiling Library

ydata profiling is a Python library designed to facilitate data profiling and exploratory data analysis (EDA) tasks. It offers a range of features aimed at understanding the structure, quality, and characteristics of datasets, making it a valuable tool for data scientists, analysts, and researchers.

## Uses of ydata Profiling

- **Data Exploration**: ydata profiling helps users explore their datasets by providing summary statistics, visualizations, and insights into the data distribution and relationships.
  
- **Data Quality Assessment**: It allows users to assess the quality of their data by identifying missing values, duplicates, outliers, and other common data quality issues.
  
- **Feature Engineering**: ydata profiling aids in feature engineering tasks by suggesting potential features, identifying relevant variables, and highlighting patterns and trends in the data.
  
- **Model Building**: It assists in the initial stages of model building by providing a comprehensive overview of the dataset, guiding feature selection, and identifying potential challenges and considerations for modeling.

## Features of ydata Profiling

- **Summary Statistics**: ydata profiling generates summary statistics such as mean, median, mode, standard deviation, quartiles, and frequency counts for numerical and categorical variables.

- **Data Visualization**: It offers a variety of visualizations including histograms, box plots, scatter plots, correlation matrices, and more to visualize the data distribution and relationships.

- **Data Quality Assessment**: ydata profiling identifies common data quality issues such as missing values, duplicates, outliers, and inconsistencies, helping users clean and preprocess their data.

- **Interactive Reports**: It generates interactive reports with detailed insights, visualizations, and recommendations that can be easily shared and explored by stakeholders.

- **Customization**: Users can customize the analysis and reports according to their specific requirements and preferences, including selecting specific analyses and visualizations to include or exclude.

## Benefits of Using ydata Profiling

- **Efficiency**: ydata profiling automates and streamlines the data profiling and EDA process, saving time and effort compared to manual exploration and analysis.

- **Insights**: It provides valuable insights into the dataset, helping users better understand the data and make informed decisions in their data analysis and modeling tasks.

- **Quality Assurance**: ydata profiling helps ensure data quality and integrity by identifying and addressing common data quality issues early in the analysis process.

- **Collaboration**: The interactive reports generated by ydata profiling facilitate collaboration and communication among team members and stakeholders, enabling them to explore and discuss the data together.

## Conclusion

ydata profiling is a versatile and powerful tool for data exploration and analysis, offering a range of features and benefits to users. By providing comprehensive insights into datasets and identifying potential issues and patterns, ydata profiling enables users to make better decisions and derive more value from their data.

Whether you're a data scientist, analyst, or researcher, ydata profiling can enhance your data analysis workflow and help you uncover actionable insights from your datasets.



In [1]:
# improt libraries
import pandas as pd
import seaborn as sns
import ydata_profiling as yd

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# import data from seaborn
df = sns.load_dataset('titanic')

In [4]:
# ydata profile report
profile = yd.ProfileReport(df)
profile.to_file(output_file='./data/ydata_titanic.html')

(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'could not convert string to float: 'no'')
  annotation = ("{:" + self.fmt + "}").format(val)
(using `df.profile_report(missing_diagrams={"Heatmap": False}`)
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'could not convert string to float: '--'')
Summarize dataset: 100%|██████████| 41/41 [00:05<00:00,  7.59it/s, Completed]               
Generate report structure: 100%|██████████| 1/1 [00:06<00:00,  6.99s/it]
Render HTML: 100%|██████████| 1/1 [00:01<00:00,  1.70s/it]
Export report to file: 100%|██████████| 1/1 [00:00<?, ?it/s]
