# Example usage

Here we will demonstrate how to effectively use the `pyeda` package to validate the format of data files and conduct basic exploratory data analysis. The steps include checking data file format, identifying missing values, and generating a summary of the data.

## Imports

In [1]:
import csv
import pandas as pd
from pyeda.check_csv import check_csv
from pyeda.pymissing_values_summary import missing_values_summary
from pyeda.data_summary import get_summary_statistics

## Create an Example CSV File

To begin, we will create a sample CSV file to demonstrate the functionality of `pyeda`. This file will include some missing values to simulate a typical data scenario. You can find this `sample_data.csv` file [here](https://github.com/UBC-MDS/pyeda/blob/main/docs/sample_data.csv).

In [2]:
# Define file name
file_name = "sample_data.csv"

# Create data with some empty values
data = [
    ["Name", "Age", "City"],
    ["Alice", "25", "New York"],
    ["Bob", "", "Los Angeles"],  # Missing age
    ["Charlie", "30", ""],       # Missing city
    ["Emily", "22", "Chicago"],      
]

# Write data to a CSV file
with open(file_name, mode="w", newline="") as file:
    writer = csv.writer(file)
    writer.writerows(data)

## Check if the Data File is in CSV Format

Before performing any analysis, it is crucial to validate whether the given file is in the correct CSV format. You can verify this using the `check_csv` method. If the file is not in CSV format, an error message will be raised to notify the user.

In [3]:
if not check_csv(file_name):
    raise TypeError("The given file is not in CSV format. Please check your data file.")

## Check if the Data File Has Any Missing Values

Once the data file format has been verified, the next step is to check whether the dataset contains any missing values using `missing_values_summary`.

This function will quickly provide a summary including:        
    •   The count of missing values for each column.                             
    •   The percentage of missing values for each column.

In [4]:
sample_df = pd.read_csv(file_name)

missing_values_summary(sample_df)

Age     1 (25.0%)
City    1 (25.0%)
Name: Missing Count (Percentage), dtype: object

## Get Data Summary

Finally, it's time to use the `get_summary_statistics` method to quickly generate the summary statistics of your dataset. You can either specify particular columns to analyze or summarize all columns if no column names are provided.  
    •	**For numeric columns**: Key statistics, including mean, minimum, maximum, median, mode, and range.  
	•	**For non-numeric columns**: Frequency-based metrics, including the number of unique values, the most frequent value, and its corresponding count.  

In [5]:
get_summary_statistics(sample_df)

Unnamed: 0,Name,Age,City
num_unique_values,4,,3
most_frequent_value,Alice,,New York
frequency_of_most_frequent_value,1,,1
mean,,25.666667,
min,,22.0,
max,,30.0,
median,,25.0,
mode,,22.0,
range,,8.0,


In [6]:
get_summary_statistics(sample_df, col=["Age", "City"])

Unnamed: 0,Age,City
mean,25.666667,
min,22.0,
max,30.0,
median,25.0,
mode,22.0,
range,8.0,
num_unique_values,,3
most_frequent_value,,New York
frequency_of_most_frequent_value,,1


## Conclusion

The `pyeda` package offers a user-friendly and efficient solution for validating data files and performing essential exploratory data analysis tasks. From checking file formats and identifying missing values to generating data statistics summaries, this tool simplifies the preprocessing stage, allowing you to focus on extracting deeper insights and making informed decisions. Try it out with your own dataset and experience its ease of use!