# Part 15: SAS, SPSS Formats and Random Sampling in Pandas

In this notebook, we'll explore:
- Working with SAS and SPSS file formats
- Performance considerations in pandas
- Random sampling from Series and DataFrames

## Setup
First, let's import the necessary libraries:

In [None]:
import pandas as pd
import numpy as np
import sqlite3
import os

## 1. SAS Formats

The top-level function `read_sas()` can read (but not write) SAS xport (.XPT) and SAS7BDAT (.sas7bdat) format files.

In [None]:
# Read a SAS7BDAT file (example - you would need an actual SAS file)
'''
df = pd.read_sas('sas_data.sas7bdat')
'''

In [None]:
# Obtain an iterator and read an XPORT file 100,000 lines at a time
'''
def do_something(chunk):
    pass

rdr = pd.read_sas('sas_xport.xpt', chunk=100000)
for chunk in rdr:
    do_something(chunk)
'''

## 2. SPSS Formats

The top-level function `read_spss()` can read (but not write) SPSS sav (.sav) and zsav (.zsav) format files.

In [None]:
# Read an SPSS file (example - you would need an actual SPSS file)
'''
df = pd.read_spss('spss_data.sav')
'''

In [None]:
# Extract a subset of columns and avoid converting categorical columns
'''
df = pd.read_spss('spss_data.sav', usecols=['foo', 'bar'],
                  convert_categoricals=False)
'''

## 3. Performance Considerations

Let's create a sample DataFrame for performance testing:

In [None]:
sz = 1000000
df = pd.DataFrame({'A': np.random.randn(sz), 'B': [1] * sz})

df.info()

In [None]:
# Example functions for testing IO performance
def test_sql_write(df):
    if os.path.exists('test.sql'):
        os.remove('test.sql')
    sql_db = sqlite3.connect('test.sql')
    df.to_sql(name='test_table', con=sql_db)
    sql_db.close()

def test_sql_read():
    sql_db = sqlite3.connect('test.sql')
    pd.read_sql_query("select * from test_table", sql_db)
    sql_db.close()

def test_hdf_fixed_write(df):
    df.to_hdf('test_fixed.hdf', 'test', mode='w')

## 4. Selecting Random Samples

A random selection of rows or columns from a Series or DataFrame can be obtained with the `sample()` method. The method will sample rows by default, and accepts a specific number of rows/columns to return, or a fraction of rows.

In [None]:
s = pd.Series([0, 1, 2, 3, 4, 5])

# When no arguments are passed, returns 1 row
s.sample()

In [None]:
# Specify a number of rows
s.sample(n=3)

In [None]:
# Or a fraction of the rows
s.sample(frac=0.5)

### 4.1 Sampling with Replacement

By default, `sample` will return each row at most once, but one can also sample with replacement using the `replace` option:

In [None]:
s = pd.Series([0, 1, 2, 3, 4, 5])

# Without replacement (default)
s.sample(n=6, replace=False)

In [None]:
# With replacement
s.sample(n=6, replace=True)

### 4.2 Sampling with Weights

By default, each row has an equal probability of being selected, but if you want rows to have different probabilities, you can pass the sample function sampling weights as `weights`.

In [None]:
s = pd.Series([0, 1, 2, 3, 4, 5])

example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]

s.sample(n=3, weights=example_weights)

In [None]:
# Weights will be re-normalized automatically
example_weights2 = [0.5, 0, 0, 0, 0, 0]

s.sample(n=1, weights=example_weights2)

### 4.3 Using DataFrame Column as Weights

When applied to a DataFrame, you can use a column of the DataFrame as sampling weights (provided you are sampling rows and not columns) by simply passing the name of the column as a string.

In [None]:
df2 = pd.DataFrame({'col1': [9, 8, 7, 6],
                   'weight_column': [0.5, 0.4, 0.1, 0]})

df2.sample(n=3, weights='weight_column')

### 4.4 Sampling Columns

The `sample` method also allows users to sample columns instead of rows using the `axis` argument.

In [None]:
df3 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})

# Sample columns instead of rows
df3.sample(n=1, axis=1)