# Loading CSV Files with Pandas

This notebook demonstrates how to use the Pandas library to load and explore CSV files. CSV (Comma-Separated Values) files are one of the most common data formats for storing structured data.

## Learning Objectives
- Import the Pandas library
- Load CSV files into DataFrames
- Understand DataFrame structure and properties
- Preview data with head() and tail() methods
- Inspect data types and missing values
- Handle common data loading scenarios

## Prerequisites
- Basic understanding of Python
- Pandas library installed

## Step 1: Import Required Libraries

First, we need to import the Pandas library, which provides powerful data manipulation and analysis tools.

In [1]:
# Step 1: Import the pandas library
import pandas as pd

print("Pandas library imported successfully!")
print(f"Pandas version: {pd.__version__}")

Pandas library imported successfully!
Pandas version: 2.3.1


## Step 2: Load CSV File into DataFrame

A DataFrame is Pandas' primary data structure for storing and manipulating tabular data. We'll load our CSV file into a DataFrame for analysis.

**Note:** Make sure the CSV file exists at the specified path. You may need to adjust the file path according to your directory structure.

In [2]:
# Step 2: Load the CSV file into a DataFrame
# Replace './data/data.csv' with the path to your actual CSV file
try:
    df = pd.read_csv("./data/data.csv")
    print("CSV file loaded successfully!")
except FileNotFoundError:
    print("CSV file not found. Please check the file path.")
    # Create a sample DataFrame for demonstration
    df = pd.DataFrame({
        'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
        'Age': [25, 30, 35, 28, 32],
        'City': ['New York', 'London', 'Tokyo', 'Paris', 'Sydney'],
        'Salary': [50000, 60000, 55000, 48000, 62000]
    })
    print("Using sample data instead.")

CSV file loaded successfully!


## Step 3: Understanding DataFrame Shape

The shape of a DataFrame tells us the number of rows and columns in our dataset. This is crucial for understanding the size of our data.

In [3]:
# Step 3: Understanding the shape of your DataFrame
print(f"DataFrame shape: {df.shape}")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

DataFrame shape: (1000, 4)
Number of rows: 1000
Number of columns: 4


## Step 4: Preview the Data

Previewing the data helps us understand what our dataset looks like without loading all the data at once. This is especially useful for large datasets.

In [4]:
# Step 4: Preview the first and last few rows of the DataFrame
print("Preview of the first 5 rows of the DataFrame:")
print(df.head())  # Displays the first 5 rows

print("\nPreview of the last 10 rows of the DataFrame:")
print(df.tail(10))  # Displays the last 10 rows

Preview of the first 5 rows of the DataFrame:
    Name   Age       City Grade
0   Sara  50.0  San Diego     F
1   Emma  57.0  San Diego     C
2  David  29.0   New York     D
3  Chris  53.0   San Jose     D
4   Sara  21.0    Phoenix     A

Preview of the last 10 rows of the DataFrame:
      Name   Age         City Grade
990  David  38.0  San Antonio     C
991   Mike  39.0     New York     F
992   Mike  45.0      Houston     A
993  David  26.0  Los Angeles   NaN
994   Sara  43.0  Los Angeles     D
995  James  26.0  Los Angeles     F
996  James   NaN      Houston     C
997  David  27.0     New York   NaN
998  Alice  41.0      Houston     C
999  Laura  59.0  Los Angeles     B


## Step 5: Inspect DataFrame Structure

The `info()` method provides a comprehensive overview of the DataFrame, including column names, data types, non-null counts, and memory usage.

In [5]:
# Step 5: Inspect the structure of the DataFrame
print("DataFrame structure and data types:")
print(df.info())  # Provides a summary of the DataFrame

DataFrame structure and data types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    1000 non-null   object 
 1   Age     996 non-null    float64
 2   City    994 non-null    object 
 3   Grade   992 non-null    object 
dtypes: float64(1), object(3)
memory usage: 31.4+ KB
None


## Step 6: Handle Missing Values

Missing values are common in real-world datasets. Identifying and handling them is crucial for data quality.

In [6]:
# Optional: Handling common issues
# Handling missing values: Checking for NaN values
print("Checking for missing values:")
missing_values = df.isnull().sum()
print(missing_values)  # Displays the number of missing values per column

# Calculate percentage of missing values
print("\nPercentage of missing values:")
missing_percentage = (df.isnull().sum() / len(df)) * 100
print(missing_percentage)

Checking for missing values:
Name     0
Age      4
City     6
Grade    8
dtype: int64

Percentage of missing values:
Name     0.0
Age      0.4
City     0.6
Grade    0.8
dtype: float64


## Advanced Loading Techniques

Here are some additional techniques for loading CSV files in different scenarios.

In [7]:
# Advanced loading techniques (commented out - uncomment to try)

# 1. Loading only a portion of a large file (e.g., first 100 rows)
# df_sample = pd.read_csv("lecture02/data/data.csv", nrows=100)
# print(f"Sample DataFrame shape: {df_sample.shape}")

# 2. Loading with custom column names
# custom_columns = ['col1', 'col2', 'col3', 'col4']
# df_custom = pd.read_csv("lecture02/data/data.csv", names=custom_columns)

# 3. Loading with specific data types
# dtype_dict = {'column_name': 'str', 'numeric_column': 'float64'}
# df_typed = pd.read_csv("lecture02/data/data.csv", dtype=dtype_dict)

# 4. Loading from different directory
# df_remote = pd.read_csv("/path/to/your/data.csv")

# 5. Loading with different separators
# df_semicolon = pd.read_csv("data.csv", sep=';')

print("Advanced loading techniques are commented out.")
print("Uncomment the lines above to try different loading scenarios.")

Advanced loading techniques are commented out.
Uncomment the lines above to try different loading scenarios.


## Summary

In this notebook, we learned how to:

1. **Import Pandas**: Essential library for data manipulation
2. **Load CSV files**: Using `pd.read_csv()` function
3. **Understand data shape**: Number of rows and columns
4. **Preview data**: Using `head()` and `tail()` methods
5. **Inspect structure**: Using `info()` method for comprehensive overview
6. **Check for missing values**: Using `isnull().sum()` to identify data quality issues
7. **Advanced loading techniques**: Various parameters for different scenarios

## Best Practices

- Always check the file path before loading
- Preview your data to understand its structure
- Check for missing values and handle them appropriately
- Use `nrows` parameter for large files to load samples first
- Specify data types when needed for better performance
- Handle file loading errors gracefully

## Next Steps

- Learn about data cleaning and preprocessing
- Explore data visualization with Pandas and Matplotlib
- Practice with different CSV files and formats
- Learn about other data formats (Excel, JSON, etc.)