# Part 11: Filtering Columns, Comments and JSON in Pandas

In this notebook, we'll explore:
- Filtering columns using `usecols` parameter
- Handling comments and empty lines
- Working with JSON data

## Setup
First, let's import the necessary libraries:

In [None]:
import pandas as pd
import numpy as np
from io import StringIO

## 1. Filtering Columns (usecols)

The `usecols` argument allows you to select any subset of the columns in a file, either using the column names, position numbers or a callable.

In [None]:
data = 'a,b,c,d\n1,2,3,foo\n4,5,6,bar\n7,8,9,baz'

pd.read_csv(StringIO(data))

In [None]:
pd.read_csv(StringIO(data), usecols=['b', 'd'])

In [None]:
pd.read_csv(StringIO(data), usecols=[0, 2, 3])

In [None]:
pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ['A', 'C'])

The `usecols` argument can also be used to specify which columns not to use in the final result:

In [None]:
pd.read_csv(StringIO(data), usecols=lambda x: x not in ['a', 'c'])

## 2. Comments and Empty Lines

### 2.1 Ignoring Line Comments and Empty Lines

If the `comment` parameter is specified, then completely commented lines will be ignored. By default, completely blank lines will be ignored as well.

In [None]:
data = ('\n'
        'a,b,c\n'
        ' \n'
        '# commented line\n'
        '1,2,3\n'
        '\n'
        '4,5,6')

print(data)

In [None]:
pd.read_csv(StringIO(data), comment='#')

If `skip_blank_lines=False`, then `read_csv` will not ignore blank lines:

In [None]:
data = ('a,b,c\n'
        '\n'
        '1,2,3\n'
        '\n'
        '\n'
        '4,5,6')

pd.read_csv(StringIO(data), skip_blank_lines=False)

### 2.2 Line Numbers and Headers with Comments

The presence of ignored lines might create ambiguities involving line numbers; the parameter `header` uses row numbers (ignoring commented/empty lines), while `skiprows` uses line numbers (including commented/empty lines):

In [None]:
data = ('#comment\n'
        'a,b,c\n'
        'A,B,C\n'
        '1,2,3')

pd.read_csv(StringIO(data), comment='#', header=1)

In [None]:
data = ('A,B,C\n'
        '#comment\n'
        'a,b,c\n'
        '1,2,3')

pd.read_csv(StringIO(data), comment='#', skiprows=2)

If both `header` and `skiprows` are specified, `header` will be relative to the end of `skiprows`:

In [None]:
data = ('# empty\n'
        '# second empty line\n'
        '# third emptyline\n'
        'X,Y,Z\n'
        '1,2,3\n'
        'A,B,C\n'
        '1,2.,4.\n'
        '5.,NaN,10.0\n')

print(data)

In [None]:
pd.read_csv(StringIO(data), comment='#', skiprows=4, header=1)

## 3. Writing Formatted Strings

The DataFrame object has an instance method `to_string` which allows control over the string representation of the object. All arguments are optional:
- `buf` default None, for example a StringIO object
- `columns` default None, which columns to write
- `col_space` default None, minimum width of each column
- `na_rep` default NaN, representation of NA value
- `formatters` default None, a dictionary (by column) of functions each of which takes a single argument and returns a formatted string
- `float_format` default None, a function which takes a single (float) argument and returns a formatted string; to be applied to floats in the DataFrame
- `sparsify` default True, set to False for a DataFrame with a hierarchical index to print every MultiIndex key at each row
- `index_names` default True, will print the names of the indices
- `index` default True, will print the index (ie, row labels)
- `header` default True, will print the column labels
- `justify` default left, will print column headers left- or right-justified

In [None]:
# Create a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [1.234, 5.678, 9.101],
    'C': ['foo', 'bar', 'baz']
})

# Default string representation
print(df.to_string())

In [None]:
# With float formatting
print(df.to_string(float_format=lambda x: f'{x:.1f}'))

In [None]:
# With column selection
print(df.to_string(columns=['A', 'C']))

The Series object also has a `to_string` method, but with only the `buf`, `na_rep`, `float_format` arguments. There is also a `length` argument which, if set to True, will additionally output the length of the Series.

In [None]:
# Series to_string example
s = pd.Series([1, 2, 3, np.nan])
print(s.to_string())
print("\nWith length:")
print(s.to_string(length=True))

## 4. JSON

### 4.1 Writing JSON

A Series or DataFrame can be converted to a valid JSON string using `to_json` with optional parameters:
- `path_or_buf`: the pathname or buffer to write the output (None returns a JSON string)
- `orient`: format of the JSON string
  - Series: default is 'index', allowed values are {'split', 'records', 'index'}
  - DataFrame: default is 'columns', allowed values are {'split', 'records', 'index', 'columns', 'values', 'table'}
- `date_format`: string, type of date conversion, 'epoch' for timestamp, 'iso' for ISO8601
- `double_precision`: decimal places for floating point values, default 10
- `force_ascii`: force encoded string to be ASCII, default True
- `date_unit`: time unit to encode to ('s', 'ms', 'us', 'ns'), default 'ms'
- `default_handler`: handler for objects that can't be converted to JSON
- `lines`: if records orient, write each record per line as JSON

In [None]:
# DataFrame to JSON examples
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['foo', 'bar', 'baz'],
    'C': pd.date_range('2021-01-01', periods=3)
})

# Default orientation (columns)
print(df.to_json())

In [None]:
# Records orientation
print(df.to_json(orient='records'))

In [None]:
# Split orientation
print(df.to_json(orient='split'))

In [None]:
# Series to JSON
s = pd.Series([1, 2, 3])
print(s.to_json())