Skip to content

codeterrayt/AutoInsight

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

AutoInsight

Project Overview

AutoInsight is an automated data analysis tool designed to analyze CSV files containing any number of columns and rows. The tool generates a variety of insights, including summary statistics, correlation analysis, visualizations of distributions, outlier detection, and Principal Component Analysis (PCA). AutoInsight aims to provide a flexible and comprehensive platform for users to quickly understand their data.

Features

  • Data Loading: Reads CSV files with an unknown structure.
  • Summary Statistics: Provides descriptive statistics for each column.
  • Missing Values Handling: Identifies and imputes missing values using mean for numeric columns and mode for categorical columns.
  • Correlation Analysis: Computes and visualizes correlations among numeric columns.
  • Visualizations: Generates histograms, boxplots, and bar plots for a variety of data insights.
  • Outlier Detection: Identifies outliers using the Interquartile Range (IQR) method.
  • PCA Visualization: Conducts PCA for dimensionality reduction and visualizes the results.

OUTPUTS

image image image image image image image image image image

Installation

To run AutoInsight, ensure that R is installed on your system. You will need the following R packages:

install.packages(c("ggplot2", "dplyr", "corrplot", "reshape2", "gridExtra", "FactoMineR", "factoextra"))

Usage

  1. Load Necessary Libraries: The following libraries are loaded at the beginning of the script:
library(ggplot2)
library(dplyr)
library(corrplot)
library(reshape2)
library(gridExtra)
library(FactoMineR)
library(factoextra)
  1. Function Definitions:

    • getmode: A utility function that computes the mode of a vector.
    • analyze_csv_data: The main function for performing data analysis.
  2. Function Arguments:

    • file_path: A string that specifies the path to the CSV file to be analyzed.
  3. Example Usage:

file_path <- "/kaggle/input/data-csv/Housing.csv"  # Replace with your file path
analyze_csv_data(file_path)

Detailed Functionality

1. Read CSV Data

The function begins by reading the specified CSV file:

data <- read.csv(file_path, stringsAsFactors = FALSE)

2. Display Initial Data

It prints the first few rows of the data and displays the data types:

print(head(data))
print("Data Types:")
print(sapply(data, class))

3. Summary Statistics

The function calculates and prints summary statistics for all columns:

print("Summary Statistics:")
print(summary(data))

4. Missing Values Analysis

The function checks for missing values and displays the count of missing entries in each column:

print("Missing Values:")
print(colSums(is.na(data)))

5. Imputation of Missing Values

Numeric columns are imputed with the mean, while categorical columns are filled with the mode:

for (col in numeric_cols) {
    data[[col]][is.na(data[[col]])] <- mean(data[[col]], na.rm = TRUE)
}
for (col in categorical_cols) {
    data[[col]][is.na(data[[col]])] <- as.character(getmode(data[[col]]))
}

6. Correlation Analysis

The function calculates the correlation matrix for numeric columns and visualizes it using corrplot and a heatmap:

correlation_matrix <- cor(numeric_data, use = "pairwise.complete.obs")
corrplot(correlation_matrix, method = "color", type = "upper", tl.col = "black", addCoef.col = "grey")
heatmap(correlation_matrix, main = "Heatmap of Correlations", col = heat.colors(256), margins = c(10, 10))

7. Histograms and Boxplots

The function generates histograms and boxplots for each numeric column, arranged in a grid:

ggplot(numeric_data, aes_string(x = col)) +
geom_histogram(...)  # for histograms
ggplot(numeric_data, aes_string(y = col)) +
geom_boxplot(...)    # for boxplots

8. Pairwise Scatter Plots

If there are multiple numeric columns, it creates pairwise scatter plots:

pairs(numeric_data, main = "Pairwise Plot of Numeric Variables")

9. Categorical Variables Visualization

The function creates bar plots for categorical variables:

ggplot(data, aes_string(x = col)) +
geom_bar(...)

10. Outlier Detection

It identifies outliers in numeric columns using the IQR method and prints them:

Q1 <- quantile(numeric_data[[col]], 0.25, na.rm = TRUE)
Q3 <- quantile(numeric_data[[col]], 0.75, na.rm = TRUE)

11. Principal Component Analysis (PCA)

If applicable, PCA is performed for dimensionality reduction, and the results are visualized:

pca_res <- PCA(numeric_data, graph = FALSE)
fviz_pca_ind(pca_res, ...)

Conclusion

AutoInsight offers a robust framework for analyzing CSV data, providing various statistical and visual insights. It can be easily extended with additional features or customized according to user needs. This tool is ideal for anyone looking to automate their data analysis process and gain insights from their datasets quickly.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages