# Example usage

Here we will demonstrate how to use `pyead` to verify the format of data files and perform basic exploratory data analysis.

## Imports

In [1]:
import csv
import pandas as pd
from pyeda.check_csv import check_csv
from pyeda.pymissing_values_summary import missing_values_summary
from pyeda.data_summary import get_summary_statistics

## Create a csv file

We'll first create a csv file to work with. You can find this `sample_data.csv` file [here](https://github.com/UBC-MDS/pyeda/blob/main/docs/sample_data.csv).

In [2]:
# Define file name
file_name = "sample_data.csv"

# Create data with some empty values
data = [
    ["Name", "Age", "City"],
    ["Alice", "25", "New York"],
    ["Bob", "", "Los Angeles"],  # Missing age
    ["Charlie", "30", ""],       # Missing city
    ["Emily", "22", "Chicago"],      
]

# Write data to a CSV file
with open(file_name, mode="w", newline="") as file:
    writer = csv.writer(file)
    writer.writerows(data)

## Check if the data file is in the csv format

To begin our exploratory data analysis, it is essential to verify whether the given file is a CSV. This can be done by calling the `check_csv` method.

In [3]:
if not check_csv(file_name):
    raise TypeError("The given file is not in CSV format. Please check your data file.")

## Check if data file has any missing values

After verifying the data file type, the next step is to check whether the data contains any missing values using `missing_values_summary`.

This function will quickly provide a summary including:        
    •   The count of missing values for each column.                             
    •   The percentage of missing values for each column.

In [7]:
sample_df = pd.read_csv(file_name)

missing_values_summary(sample_df)

Age     1 (25.0%)
City    1 (25.0%)
Name: Missing Count (Percentage), dtype: object

## Get data summary

Now it's time to use the `get_summary_statistics` method to quickly generate the summary statistics of your dataset. You can either specify particular columns to analyze or summarize all columns if no column names are provided.  
    •	For numeric columns, the function calculates metrics, including mean, minimum, maximum, median, mode, and range.  
	•	For non-numeric columns, it provides frequency-based metrics like the number of unique values, the most frequent value, and its corresponding count.  

In [5]:
get_summary_statistics(sample_df)
get_summary_statistics(sample_df, col=["Age", "City"])

Unnamed: 0,Age,City
mean,25.666667,
min,22.0,
max,30.0,
median,25.0,
mode,22.0,
range,8.0,
num_unique_values,,3
most_frequent_value,,New York
frequency_of_most_frequent_value,,1
