# BigQuery Basics - Getting Started

This notebook demonstrates how to connect to Google BigQuery and perform basic queries using the Python client library.

## Setup and Authentication

First, ensure you have authenticated with Google Cloud:
```bash
gcloud auth application-default login
```

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from google.cloud import bigquery
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set plot style
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')

## Initialize BigQuery Client

In [None]:
# Initialize the BigQuery client
# Replace 'your-project-id' with your actual project ID
PROJECT_ID = 'your-project-id'  # TODO: Update this
client = bigquery.Client(project=PROJECT_ID)

print(f"Connected to project: {client.project}")

## Example 1: Basic Query

In [None]:
# Example query using a public dataset
# This uses the BigQuery public dataset for demonstration
query = """
SELECT 
    name,
    SUM(number) as total_count,
    COUNT(DISTINCT year) as years_popular
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year >= 2000
GROUP BY name
ORDER BY total_count DESC
LIMIT 10
"""

# Execute query and convert to DataFrame
df = client.query(query).to_dataframe()
df

## Example 2: Query with Parameters

In [None]:
# Using parameterized queries for safety and reusability
job_config = bigquery.QueryJobConfig(
    query_parameters=[
        bigquery.ScalarQueryParameter("min_year", "INT64", 2010),
        bigquery.ScalarQueryParameter("limit", "INT64", 20),
    ]
)

parameterized_query = """
SELECT 
    year,
    gender,
    SUM(number) as total_births
FROM `bigquery-public-data.usa_names.usa_1910_current`
WHERE year >= @min_year
GROUP BY year, gender
ORDER BY year, gender
LIMIT @limit
"""

df_params = client.query(parameterized_query, job_config=job_config).to_dataframe()
df_params.head()

## Example 3: Working with Your Own Data

In [None]:
# Template for querying your own datasets
# Update the dataset and table names

your_query = """
SELECT 
    *
FROM `{project}.{dataset}.{table}`
LIMIT 100
""".format(
    project=PROJECT_ID,
    dataset='your_dataset_name',  # TODO: Update this
    table='your_table_name'        # TODO: Update this
)

# Uncomment to run:
# df_your_data = client.query(your_query).to_dataframe()
# df_your_data.head()

## Example 4: Simple Visualization

In [None]:
# Create a simple visualization from the parameterized query results
if not df_params.empty:
    # Pivot the data for plotting
    pivot_df = df_params.pivot(index='year', columns='gender', values='total_births')
    
    # Create the plot
    fig, ax = plt.subplots(figsize=(10, 6))
    pivot_df.plot(kind='bar', ax=ax)
    
    ax.set_title('Total Births by Year and Gender', fontsize=16)
    ax.set_xlabel('Year', fontsize=12)
    ax.set_ylabel('Total Births', fontsize=12)
    ax.legend(title='Gender')
    
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

## Best Practices

1. **Use parameterized queries** to prevent SQL injection
2. **Limit query results** during development to control costs
3. **Cache results locally** when doing iterative analysis
4. **Use appropriate data types** in your queries
5. **Monitor query costs** in the BigQuery console

## Saving Results

In [None]:
# Save results to CSV (will be git-ignored)
# df.to_csv('../../data/query_results.csv', index=False)

# Save visualizations to your project folder
# fig.savefig('../outputs/births_by_gender.png', dpi=300, bbox_inches='tight')

## Next Steps

- Update the PROJECT_ID with your actual project
- Modify queries to work with your datasets
- Create more complex analyses
- Share findings via Google Drive