# Notebook 2: Working with CSV Data in Colab

**Session 2: The AI-Empowered Coder**  
*Generative AI for Scholarship — Harvard HDSI & FAS*

© 2026 President and Fellows of Harvard College. Licensed under CC BY-NC 4.0.

---

## What This Notebook Covers

This notebook demonstrates how to **create, save, read, and analyze CSV data**
within Google Colab's session storage.

1. **Generate** a synthetic dataset (2 columns, 100 rows)
2. **Write** it to a CSV file in Colab's working directory
3. **Read** the CSV file back into Python
4. **Analyze** the data with summary statistics and a plot
5. **Save** analysis results to a new CSV file

### Important: Colab Session Storage

Files created in Colab's working directory (`/content/`) are **temporary**.
They exist only for the duration of your Colab session. If the runtime
disconnects (due to inactivity or timeout), these files are deleted.

To keep files permanently, save them to **Google Drive** (see Notebook 3).

---

## Step 1: Import Libraries

We need two libraries, both pre-installed in Colab:
- **NumPy** — for generating random numerical data
- **Pandas** — for working with tabular data (DataFrames) and CSV files

In [None]:
# ============================================================
# Import the libraries we need.
# Both are pre-installed in Google Colab.
# ============================================================

import numpy as np       # Numerical computing and random number generation
import pandas as pd      # DataFrames and CSV file I/O

print(f"NumPy version:  {np.__version__}")
print(f"Pandas version: {pd.__version__}")

---

## Step 2: Generate Synthetic Data

We'll create a dataset with **100 rows** and **2 columns**:

| Column | Description |
|--------|-------------|
| `day` | Day number (1 through 100) — represents consecutive days of observation |
| `temperature_C` | Daily average temperature in Celsius — a seasonal pattern with random noise |

The temperature follows a **sinusoidal seasonal pattern** (warmer in summer,
cooler in winter) plus random day-to-day variation. This mimics what you
might see in real weather data.

We use `np.random.seed(42)` to make the random numbers **reproducible** —
everyone running this notebook gets the same data.

In [None]:
# ============================================================
# Generate a synthetic dataset: 100 days of temperature data.
#
# The temperature model is:
#   T(day) = 15 + 10 * sin(2π * day / 100) + noise
#
# This gives a baseline of 15°C, a seasonal swing of ±10°C,
# and random daily variation (noise) with std dev of 2°C.
# ============================================================

# Set random seed for reproducibility
np.random.seed(42)

# Number of data points
n_rows = 100

# Column 1: Day number (1 through 100)
day = np.arange(1, n_rows + 1)

# Column 2: Temperature with a seasonal pattern + random noise
seasonal_pattern = 15 + 10 * np.sin(2 * np.pi * day / n_rows)
daily_noise = np.random.normal(loc=0, scale=2, size=n_rows)
temperature_C = seasonal_pattern + daily_noise

# Round temperatures to 1 decimal place (realistic precision)
temperature_C = np.round(temperature_C, 1)

# Assemble into a Pandas DataFrame
df = pd.DataFrame({
    'day': day,
    'temperature_C': temperature_C
})

# Display the first 10 rows to verify the data looks reasonable
print(f"Generated {len(df)} rows with {len(df.columns)} columns.")
print(f"\nFirst 10 rows:")
df.head(10)

---

## Step 3: Write the Data to a CSV File

CSV (Comma-Separated Values) is the most common format for sharing tabular data.
Pandas makes it simple to write a DataFrame to CSV with `df.to_csv()`.

The file will be saved in Colab's current working directory (`/content/`),
which you can browse using the **folder icon** in Colab's left sidebar.

In [None]:
# ============================================================
# Write the DataFrame to a CSV file.
#
# index=False prevents pandas from writing the row index
# as an extra column (we don't need it since 'day' already
# serves as our identifier).
# ============================================================

output_file = 'daily_temperatures.csv'

df.to_csv(output_file, index=False)

print(f"Data written to: {output_file}")
print(f"File size: {os.path.getsize(output_file):,} bytes" if __import__('os').path.exists(output_file) else "Error: file not created")

### Verify: Look at the Raw CSV File

Let's peek at the first few lines of the file to see what CSV format looks like.
The `!head` command is a Linux shell command (Colab runs on Linux) that
shows the first N lines of a file.

In [None]:
# ============================================================
# Display the first 5 lines of the CSV file.
# The '!' prefix runs a shell command instead of Python.
# ============================================================

!head -5 daily_temperatures.csv

You should see something like:
```
day,temperature_C
1,16.0
2,15.7
3,16.3
4,18.0
```

The first line is the **header** (column names), and each subsequent line
is one row of data with values separated by commas.

---

## Step 4: Read the CSV File Back

Now let's read the file we just created, as if we were loading data
from an external source. This uses `pd.read_csv()`, which is the
most common way to load tabular data in Python.

In [None]:
# ============================================================
# Read the CSV file into a new DataFrame.
#
# pd.read_csv() automatically detects:
#   - The header row (column names)
#   - Data types (integers, floats, strings)
#   - The delimiter (comma by default)
# ============================================================

df_loaded = pd.read_csv('daily_temperatures.csv')

# Confirm it loaded correctly
print(f"Loaded {len(df_loaded)} rows and {len(df_loaded.columns)} columns.")
print(f"Column names: {list(df_loaded.columns)}")
print(f"Data types:")
print(df_loaded.dtypes)
print(f"\nFirst 5 rows:")
df_loaded.head()

---

## Step 5: Analyze the Data

Pandas provides built-in methods for **summary statistics**.
The `.describe()` method gives you count, mean, standard deviation,
min, max, and quartiles in one call.

In [None]:
# ============================================================
# Summary statistics for the temperature column.
#
# .describe() returns: count, mean, std, min, 25%, 50%, 75%, max
# ============================================================

print("Summary Statistics")
print("=" * 30)
print(df_loaded['temperature_C'].describe())

print(f"\nAdditional statistics:")
print(f"  Median:   {df_loaded['temperature_C'].median():.1f} °C")
print(f"  Range:    {df_loaded['temperature_C'].min():.1f} – {df_loaded['temperature_C'].max():.1f} °C")
print(f"  Variance: {df_loaded['temperature_C'].var():.2f} °C²")

### Plot the Data

A plot makes the seasonal pattern immediately visible.
We'll add a horizontal line at the mean temperature for reference.

In [None]:
# ============================================================
# Plot temperature vs. day to visualize the seasonal pattern.
# ============================================================

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 4))

# Plot the temperature data as points connected by lines
ax.plot(df_loaded['day'], df_loaded['temperature_C'],
        color='#A51C30', linewidth=1, alpha=0.7, label='Daily temperature')

# Add a horizontal line at the mean
mean_temp = df_loaded['temperature_C'].mean()
ax.axhline(y=mean_temp, color='gray', linestyle='--', alpha=0.5,
           label=f'Mean ({mean_temp:.1f} °C)')

# Labels and formatting
ax.set_xlabel('Day')
ax.set_ylabel('Temperature (°C)')
ax.set_title('Daily Average Temperature — 100 Days of Synthetic Data')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## Step 6: Save Analysis Results to a New CSV

Let's add a computed column to the data and save the enhanced version
to a new CSV file. We'll convert Celsius to Fahrenheit and include both.

In [None]:
# ============================================================
# Add a Fahrenheit column and save the enhanced data.
#
# The conversion formula is: F = C × 9/5 + 32
# ============================================================

# Add the new column
df_loaded['temperature_F'] = np.round(df_loaded['temperature_C'] * 9/5 + 32, 1)

# Save to a new CSV file
output_file = 'daily_temperatures_enhanced.csv'
df_loaded.to_csv(output_file, index=False)

print(f"Enhanced data written to: {output_file}")
print(f"\nFirst 5 rows of enhanced data:")
df_loaded.head()

### Downloading Files from Colab

Since Colab session storage is temporary, you may want to **download**
your output files to your local machine. You can do this two ways:

1. **File browser**: Click the folder icon in the left sidebar, find the file, right-click → Download
2. **Code**: Use the cell below to trigger a download in your browser

In [None]:
# ============================================================
# Download a file from Colab to your local machine.
#
# This uses Colab's built-in files module to trigger a
# browser download. A download dialog will appear.
# ============================================================

from google.colab import files

files.download('daily_temperatures_enhanced.csv')

print("Download triggered — check your browser's download folder.")

---

## Try It: Use the Magic Wand

Now try using Colab's **magic wand** (cell-level AI) to extend this analysis.
Click on the empty code cell below, then click the magic wand icon and try
one of these prompts:

- **"Create a histogram of the temperature data with 15 bins"**
- **"Find the 5 warmest and 5 coldest days in the dataset"**
- **"Calculate a 7-day rolling average and plot it on top of the raw data"**

In [None]:
# Use the magic wand to generate analysis code here



---

## Summary

| Operation | Code | Notes |
|-----------|------|-------|
| Write CSV | `df.to_csv('file.csv', index=False)` | `index=False` avoids writing row numbers |
| Read CSV | `pd.read_csv('file.csv')` | Auto-detects headers and data types |
| Statistics | `df['col'].describe()` | Count, mean, std, min, quartiles, max |
| Plot | `plt.plot(x, y)` | Use matplotlib for quick visualization |
| Download | `files.download('file.csv')` | Triggers browser download from Colab |

**Remember:** Files in Colab's `/content/` directory are **temporary**.
Download them or save to Google Drive before your session ends.

---

**Next:** Open **Notebook 3** to learn how to read files from your Google Drive.