AutoInsight is an automated data analysis tool designed to analyze CSV files containing any number of columns and rows. The tool generates a variety of insights, including summary statistics, correlation analysis, visualizations of distributions, outlier detection, and Principal Component Analysis (PCA). AutoInsight aims to provide a flexible and comprehensive platform for users to quickly understand their data.
- Data Loading: Reads CSV files with an unknown structure.
- Summary Statistics: Provides descriptive statistics for each column.
- Missing Values Handling: Identifies and imputes missing values using mean for numeric columns and mode for categorical columns.
- Correlation Analysis: Computes and visualizes correlations among numeric columns.
- Visualizations: Generates histograms, boxplots, and bar plots for a variety of data insights.
- Outlier Detection: Identifies outliers using the Interquartile Range (IQR) method.
- PCA Visualization: Conducts PCA for dimensionality reduction and visualizes the results.
To run AutoInsight, ensure that R is installed on your system. You will need the following R packages:
install.packages(c("ggplot2", "dplyr", "corrplot", "reshape2", "gridExtra", "FactoMineR", "factoextra"))- Load Necessary Libraries: The following libraries are loaded at the beginning of the script:
library(ggplot2)
library(dplyr)
library(corrplot)
library(reshape2)
library(gridExtra)
library(FactoMineR)
library(factoextra)-
Function Definitions:
- getmode: A utility function that computes the mode of a vector.
- analyze_csv_data: The main function for performing data analysis.
-
Function Arguments:
file_path: A string that specifies the path to the CSV file to be analyzed.
-
Example Usage:
file_path <- "/kaggle/input/data-csv/Housing.csv" # Replace with your file path
analyze_csv_data(file_path)The function begins by reading the specified CSV file:
data <- read.csv(file_path, stringsAsFactors = FALSE)It prints the first few rows of the data and displays the data types:
print(head(data))
print("Data Types:")
print(sapply(data, class))The function calculates and prints summary statistics for all columns:
print("Summary Statistics:")
print(summary(data))The function checks for missing values and displays the count of missing entries in each column:
print("Missing Values:")
print(colSums(is.na(data)))Numeric columns are imputed with the mean, while categorical columns are filled with the mode:
for (col in numeric_cols) {
data[[col]][is.na(data[[col]])] <- mean(data[[col]], na.rm = TRUE)
}
for (col in categorical_cols) {
data[[col]][is.na(data[[col]])] <- as.character(getmode(data[[col]]))
}The function calculates the correlation matrix for numeric columns and visualizes it using corrplot and a heatmap:
correlation_matrix <- cor(numeric_data, use = "pairwise.complete.obs")
corrplot(correlation_matrix, method = "color", type = "upper", tl.col = "black", addCoef.col = "grey")
heatmap(correlation_matrix, main = "Heatmap of Correlations", col = heat.colors(256), margins = c(10, 10))The function generates histograms and boxplots for each numeric column, arranged in a grid:
ggplot(numeric_data, aes_string(x = col)) +
geom_histogram(...) # for histograms
ggplot(numeric_data, aes_string(y = col)) +
geom_boxplot(...) # for boxplotsIf there are multiple numeric columns, it creates pairwise scatter plots:
pairs(numeric_data, main = "Pairwise Plot of Numeric Variables")The function creates bar plots for categorical variables:
ggplot(data, aes_string(x = col)) +
geom_bar(...)It identifies outliers in numeric columns using the IQR method and prints them:
Q1 <- quantile(numeric_data[[col]], 0.25, na.rm = TRUE)
Q3 <- quantile(numeric_data[[col]], 0.75, na.rm = TRUE)If applicable, PCA is performed for dimensionality reduction, and the results are visualized:
pca_res <- PCA(numeric_data, graph = FALSE)
fviz_pca_ind(pca_res, ...)AutoInsight offers a robust framework for analyzing CSV data, providing various statistical and visual insights. It can be easily extended with additional features or customized according to user needs. This tool is ideal for anyone looking to automate their data analysis process and gain insights from their datasets quickly.









