Writtend mostly by GPT-4, this R package renders an EDA report based on this R Markdown file. It can be used to generate an EDA report like this, from any data set. You can also generate this report using the Shiny app RTutor. Contact or feedback Steven Ge
library("remotes")
install_github("gexijin/inspect")
library(inspect)
eda(mtcars) # Generate EDA report for a data frame, i.e. mtcars
eda(iris, "Species") # Specifying a dependent/target variable
Exploratory data analysis (EDA) is an essential first step in any data science project. Consider it the equivalent of an annual doctor’s check-up but for data science projects. I have long believed that EDA can be automated as the tasks are very general. While there are existing R packages for EDA such as DataExplorer, summarytools, tableone, and GGally, I have not found what I was looking for. Leveraging GPT-4, I was able to create an EDA script in just a few hours.
Given a data set, the main idea is to streamline these steps:
- Starts with a data summary.
- Any missing values and outliers?
- Plots distribution of numerical variables using histograms and QQ plots. When excessive skewness is present, a log transformation is recommended.
- Distribution of categorical variables.
- It provides a general data overview with a heatmap and a correlation plot.
- Correlation matrix (corrplot)
- Scatter plots to examine correlations between numerical variables.
- It uses violin plots and performs ANOVA to study the differences between groups delineated by categorical variables.
- Are categorical variables independent of each other? Uses Chi-squared test and bar plots.
To use this RMarkdown file, you just need to obtain a copy from this GitHub repository. Replace the demo data file with your own, specify a target variable, and you’re ready to render the report.
If that sounds like too much work, simply upload your data file to RTutor.ai, and click on the EDA tab. A comprehensive report will be generated in 2 minutes. The template was originally written for RTutor.