# Session 13: Introduction to Pandas and Reading Data

Welcome to the world of **Pandas**! This library is the cornerstone of data analysis in Python. In this session, we'll learn what Pandas is, how to create and explore DataFrames, and how to read data from files.

## Learning Objectives

By the end of this session, you will be able to:
- Explain what Pandas is and why it's essential for data analysis
- Create Series and DataFrames from Python data structures
- Inspect DataFrames using basic properties and methods
- Read data from CSV and JSON files
- Save DataFrames to files

## 1. What is Pandas and Why Use It?

**Pandas** is a powerful Python library for data manipulation and analysis. The name comes from "Panel Data" (an econometrics term) and "Python Data Analysis".

### Why Pandas?

- **Efficient data structures**: DataFrames and Series make working with tabular data intuitive
- **Data cleaning**: Built-in tools for handling missing values, duplicates, and data type conversions
- **Data manipulation**: Easy filtering, grouping, merging, and reshaping
- **File I/O**: Read and write data from various formats (CSV, Excel, JSON, SQL, etc.)
- **Integration**: Works seamlessly with NumPy, Matplotlib, and other data science libraries

### Installing and Importing Pandas

Pandas is typically installed via pip: `pip install pandas`

The convention is to import it with the alias `pd`:

In [None]:
import pandas as pd

# Check the version
print(f"Pandas version: {pd.__version__}")

## 2. Series: The Building Block

A **Series** is a one-dimensional labeled array. Think of it as a single column of data with an index.

### Creating a Series

In [None]:
# Creating a Series from a list
grades = pd.Series([85, 90, 78, 92, 88])
print(grades)
print(f"\nType: {type(grades)}")

In [None]:
# Creating a Series with custom index
grades = pd.Series(
    [85, 90, 78, 92, 88],
    index=['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']
)
print(grades)

In [None]:
# Creating a Series from a dictionary
population = pd.Series({
    'Madrid': 3_223_000,
    'Barcelona': 1_620_000,
    'Valencia': 791_000,
    'Seville': 688_000
})
print(population)

In [None]:
# Accessing elements in a Series
print(f"Bob's grade: {grades['Bob']}")
print(f"First grade: {grades[0]}")
print(f"\nGrades above 85:")
print(grades[grades > 85])

### Series Attributes and Methods

In [None]:
print(f"Values: {grades.values}")
print(f"Index: {grades.index.tolist()}")
print(f"Data type: {grades.dtype}")
print(f"Size: {grades.size}")
print(f"Mean: {grades.mean():.2f}")

## 3. DataFrames: The Star of the Show

A **DataFrame** is a two-dimensional labeled data structure with columns of potentially different types. Think of it as a spreadsheet or SQL table.

### Creating DataFrames from Dictionaries

In [None]:
# Creating a DataFrame from a dictionary of lists
# Each key becomes a column name, each list becomes the column values

students = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [22, 25, 23, 24, 22],
    'major': ['Economics', 'Computer Science', 'Economics', 'Finance', 'Marketing'],
    'gpa': [3.8, 3.5, 3.2, 3.9, 3.6]
})

students

In [None]:
# Creating a DataFrame from a list of dictionaries
# Each dictionary represents a row

products = pd.DataFrame([
    {'product': 'Laptop', 'price': 999.99, 'stock': 50},
    {'product': 'Mouse', 'price': 29.99, 'stock': 200},
    {'product': 'Keyboard', 'price': 79.99, 'stock': 150},
    {'product': 'Monitor', 'price': 299.99, 'stock': 75}
])

products

In [None]:
# Creating a DataFrame with a custom index
sales = pd.DataFrame(
    {
        'units_sold': [150, 200, 175, 225],
        'revenue': [15000, 20000, 17500, 22500]
    },
    index=['Q1', 'Q2', 'Q3', 'Q4']
)

sales

### Creating DataFrames from Lists

In [None]:
# From a list of lists (need to specify column names)
data = [
    ['Madrid', 'Spain', 3223000],
    ['Barcelona', 'Spain', 1620000],
    ['Lisbon', 'Portugal', 545000],
    ['Porto', 'Portugal', 238000]
]

cities = pd.DataFrame(data, columns=['city', 'country', 'population'])
cities

## 4. Basic DataFrame Properties

Once you have a DataFrame, you'll want to understand its structure. Here are the essential properties:

In [None]:
# Let's create a larger dataset for demonstration
employees = pd.DataFrame({
    'employee_id': [101, 102, 103, 104, 105, 106, 107, 108],
    'name': ['Ana Garcia', 'Carlos Lopez', 'Maria Santos', 'Juan Martinez', 
             'Laura Fernandez', 'Pedro Gonzalez', 'Sofia Rodriguez', 'Diego Hernandez'],
    'department': ['Sales', 'IT', 'HR', 'Sales', 'IT', 'Marketing', 'HR', 'Sales'],
    'salary': [45000, 55000, 48000, 52000, 60000, 47000, 46000, 49000],
    'years_experience': [3, 5, 4, 6, 7, 2, 3, 4],
    'is_manager': [False, True, False, True, True, False, False, False]
})

employees

In [None]:
# Shape: (rows, columns)
print(f"Shape: {employees.shape}")
print(f"Number of rows: {employees.shape[0]}")
print(f"Number of columns: {employees.shape[1]}")

In [None]:
# Column names
print("Columns:")
print(employees.columns)
print(f"\nAs a list: {employees.columns.tolist()}")

In [None]:
# Data types of each column
print("Data types:")
print(employees.dtypes)

In [None]:
# The info() method: comprehensive summary
print("DataFrame Info:")
employees.info()

In [None]:
# Index
print(f"Index: {employees.index}")
print(f"Index as list: {employees.index.tolist()}")

## 5. Viewing Data: head(), tail(), sample()

With large datasets, you don't want to display everything. These methods let you peek at the data:

In [None]:
# head() - first n rows (default 5)
print("First 3 rows:")
employees.head(3)

In [None]:
# tail() - last n rows (default 5)
print("Last 3 rows:")
employees.tail(3)

In [None]:
# sample() - random rows
print("3 random rows:")
employees.sample(3)

In [None]:
# sample() with random_state for reproducibility
print("3 random rows (reproducible):")
employees.sample(3, random_state=42)

## 6. Basic Statistics with describe()

The `describe()` method provides summary statistics for numerical columns:

In [None]:
# Summary statistics for numerical columns
employees.describe()

In [None]:
# Include all columns (including non-numeric)
employees.describe(include='all')

## 7. Reading Data from Files

Real-world data usually comes from files. Pandas makes reading data incredibly easy.

### Reading CSV Files

CSV (Comma-Separated Values) is the most common format for tabular data.

In [None]:
# First, let's create a sample CSV file to work with
sample_data = pd.DataFrame({
    'date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18', '2024-01-19'],
    'product': ['Widget A', 'Widget B', 'Widget A', 'Widget C', 'Widget B'],
    'quantity': [10, 5, 8, 12, 7],
    'unit_price': [25.99, 15.50, 25.99, 35.00, 15.50],
    'customer': ['Acme Corp', 'TechStart', 'Acme Corp', 'BigRetail', 'TechStart']
})

# Save to CSV
sample_data.to_csv('sales_data.csv', index=False)
print("Sample CSV file created!")

In [None]:
# Reading a CSV file
df = pd.read_csv('sales_data.csv')
df

In [None]:
# Common read_csv parameters

# sep: specify delimiter (default is comma)
# df = pd.read_csv('file.csv', sep=';')  # for semicolon-separated

# header: row number to use as column names (default 0)
# df = pd.read_csv('file.csv', header=None)  # no header row

# names: provide column names
# df = pd.read_csv('file.csv', names=['col1', 'col2', 'col3'])

# usecols: read only specific columns
df_subset = pd.read_csv('sales_data.csv', usecols=['date', 'product', 'quantity'])
print("Reading only specific columns:")
df_subset

In [None]:
# nrows: read only first n rows (useful for large files)
df_preview = pd.read_csv('sales_data.csv', nrows=3)
print("Reading only first 3 rows:")
df_preview

In [None]:
# index_col: use a column as the index
df_indexed = pd.read_csv('sales_data.csv', index_col='date')
print("Using 'date' as index:")
df_indexed

### Reading JSON Files

JSON is common for web APIs and modern data exchange.

In [None]:
# Create a sample JSON file
import json

json_data = [
    {"name": "Python for Data Analysis", "author": "Wes McKinney", "year": 2022, "price": 49.99},
    {"name": "Hands-On Machine Learning", "author": "Aurelien Geron", "year": 2022, "price": 59.99},
    {"name": "Deep Learning", "author": "Ian Goodfellow", "year": 2016, "price": 72.00},
    {"name": "The Pragmatic Programmer", "author": "David Thomas", "year": 2019, "price": 49.99}
]

with open('books.json', 'w') as f:
    json.dump(json_data, f, indent=2)
    
print("Sample JSON file created!")

In [None]:
# Reading JSON
books = pd.read_json('books.json')
books

In [None]:
# JSON can have different orientations
# Let's create a records-oriented JSON
records_json = '{"name":{"0":"Alice","1":"Bob"},"age":{"0":25,"1":30}}'

with open('records.json', 'w') as f:
    f.write(records_json)

# Reading with different orient
df_records = pd.read_json('records.json', orient='columns')
df_records

## 8. Handling File Issues

Real-world files often have issues. Here's how to handle common problems:

In [None]:
# Create a problematic CSV file
problematic_csv = """name;age;salary
Alice;25;50000
Bob;30;N/A
Charlie;-;55000
Diana;28;52000"""

with open('problematic.csv', 'w') as f:
    f.write(problematic_csv)
    
print("Problematic CSV created!")

In [None]:
# Handle different delimiter
df = pd.read_csv('problematic.csv', sep=';')
df

In [None]:
# Handle custom missing values
df = pd.read_csv('problematic.csv', sep=';', na_values=['N/A', '-', 'NA', 'null'])
df

In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

In [None]:
# Handling encoding issues (common with international characters)
# Create a file with special characters
special_chars = """ciudad,poblacion
Madrid,3223000
Malaga,571000
Coruna,245000"""

with open('ciudades.csv', 'w', encoding='utf-8') as f:
    f.write(special_chars)

# Read with proper encoding
ciudades = pd.read_csv('ciudades.csv', encoding='utf-8')
ciudades

In [None]:
# Handling file not found errors
import os

filename = 'nonexistent.csv'

if os.path.exists(filename):
    df = pd.read_csv(filename)
else:
    print(f"File '{filename}' not found!")

In [None]:
# Using try/except for robust file reading
try:
    df = pd.read_csv('nonexistent.csv')
except FileNotFoundError:
    print("File not found! Please check the path.")
except pd.errors.EmptyDataError:
    print("File is empty!")
except Exception as e:
    print(f"An error occurred: {e}")

## 9. Saving DataFrames

After processing data, you'll often want to save the results.

In [None]:
# Create a sample DataFrame to save
results = pd.DataFrame({
    'student': ['Alice', 'Bob', 'Charlie'],
    'score': [95, 87, 92],
    'passed': [True, True, True]
})

results

In [None]:
# Save to CSV
results.to_csv('results.csv', index=False)  # index=False prevents saving the index
print("Saved to results.csv")

# Verify by reading it back
pd.read_csv('results.csv')

In [None]:
# Save to CSV with custom separator
results.to_csv('results_semicolon.csv', sep=';', index=False)
print("Saved with semicolon separator")

In [None]:
# Save to JSON
results.to_json('results.json', orient='records', indent=2)
print("Saved to results.json")

# Let's see what it looks like
with open('results.json', 'r') as f:
    print(f.read())

In [None]:
# Save to Excel (requires openpyxl: pip install openpyxl)
try:
    results.to_excel('results.xlsx', index=False, sheet_name='Scores')
    print("Saved to results.xlsx")
except ModuleNotFoundError:
    print("openpyxl not installed. Run: pip install openpyxl")

## 10. Accessing Columns

A quick preview of how to access individual columns (we'll cover this in depth next session):

In [None]:
# Access a single column (returns a Series)
print("Name column:")
print(employees['name'])
print(f"\nType: {type(employees['name'])}")

In [None]:
# Access using dot notation (only works for column names without spaces)
print("Salary column:")
print(employees.salary)

In [None]:
# Access multiple columns (returns a DataFrame)
print("Name and salary columns:")
employees[['name', 'salary']]

In [None]:
# Create a new column
employees['bonus'] = employees['salary'] * 0.1
employees

## Summary

In this session, we covered:

1. **What Pandas is**: A powerful library for data manipulation and analysis
2. **Series**: One-dimensional labeled arrays
3. **DataFrames**: Two-dimensional labeled data structures (the main workhorse)
4. **Creating DataFrames**: From dictionaries and lists
5. **Basic properties**: `shape`, `columns`, `dtypes`, `info()`
6. **Viewing data**: `head()`, `tail()`, `sample()`
7. **Statistics**: `describe()`
8. **Reading files**: `read_csv()`, `read_json()`
9. **Handling issues**: Delimiters, missing values, encoding
10. **Saving data**: `to_csv()`, `to_json()`, `to_excel()`

### Key Points to Remember

- Import pandas as: `import pandas as pd`
- A DataFrame is like a spreadsheet: rows and columns
- Always check your data with `head()`, `info()`, and `describe()`
- `read_csv()` and `to_csv()` are your most common file operations
- Use `index=False` when saving to avoid extra index columns

### Next Session

We'll practice everything we learned today with hands-on exercises!

In [None]:
# Cleanup: remove the temporary files we created
import os

files_to_remove = [
    'sales_data.csv', 'books.json', 'records.json', 
    'problematic.csv', 'ciudades.csv', 'results.csv', 
    'results_semicolon.csv', 'results.json', 'results.xlsx'
]

for file in files_to_remove:
    if os.path.exists(file):
        os.remove(file)
        
print("Temporary files cleaned up!")