# Exploring DataFrames

In this notebook, we will learn how to explore and understand the structure of a DataFrame. Exploring a DataFrame is a critical step in the data analysis process to familiarize yourself with the dataset.

## Topics Covered:
- Methods for initial exploration (`head()`, `info()`, `describe()`)
- Understanding data distributions
- Shape and size operations
- Practical: Exploring a real dataset

## Methods for Initial Exploration

### 1. Viewing the First Few Rows: `head()`
The `head()` method allows you to quickly view the first few rows of a DataFrame, providing a glimpse into its structure. By default, it shows the first 5 rows.

In [None]:
import pandas as pd

# Example DataFrame
data = {"Name": ["Alice", "Bob", "Charlie", "David", "Eva"], "Age": [25, 30, 35, 40, 45], "City": ["New York", "Los Angeles", "Chicago", "Houston", "Phoenix"]}
df = pd.DataFrame(data)

# Viewing the first few rows
print(df.head())

### 2. Getting a Summary of the DataFrame: `info()`

The `info()` method provides a concise summary of the DataFrame, including:
- Column names and data types
- Non-null counts
- Memory usage

In [None]:
# Getting a summary of the DataFrame
print(df.info())

### 3. Descriptive Statistics: `describe()`

The `describe()` method provides descriptive statistics for numerical columns, such as:
- Mean
- Standard deviation
- Minimum and maximum values
- Percentiles (25th, 50th, 75th)

This is a great way to quickly understand the distribution of your data.

In [None]:
# Descriptive statistics
print(df.describe())

## Understanding Data Distributions

Data distribution refers to the way data points are spread across a range of values. Understanding the distribution helps identify patterns, anomalies, and outliers.

### Useful Methods:
- `value_counts()` for categorical data
- Histograms and boxplots for numerical data
- Visualization using libraries like Matplotlib or Seaborn

In [None]:
# Distribution of categorical data
print(df['City'].value_counts())

# Importing Matplotlib for visualizations
import matplotlib.pyplot as plt

# Plotting a histogram for numerical data
df['Age'].plot(kind='hist', bins=5, title='Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

## Shape and Size Operations

Understanding the dimensions of a DataFrame is crucial for handling large datasets.

### Methods:
- `shape`: Returns the number of rows and columns in the DataFrame.
- `len()`: Returns the number of rows.
- `size`: Returns the total number of elements in the DataFrame.

In [None]:
# Shape of the DataFrame
print('Shape:', df.shape)

# Number of rows
print('Number of rows:', len(df))

# Total number of elements
print('Total elements:', df.size)

## Practical: Exploring a Real Dataset

Let’s apply these methods to a real-world dataset. We will use the COVID-19 dataset from Indonesia to explore its structure and distribution.

### Steps:
1. Load the dataset.
2. Perform initial exploration using `head()`, `info()`, and `describe()`.
3. Analyze the distribution of key columns.

In [None]:
# Load the COVID-19 dataset
covid_data = pd.read_csv('../DataSets/Data_COVID19_Indonesia.csv')

# Initial exploration
print(covid_data.head())
print(covid_data.info())
print(covid_data.describe())

# Analyze the distribution of total cases
covid_data['Total Cases'].plot(kind='hist', bins=10, title='Total Cases Distribution')
plt.xlabel('Total Cases')
plt.ylabel('Frequency')
plt.show()