Reading data and making it accessible (often called data loading) is a very important. <br>
The term parsing is also sometimes used to describe loading text data and interpreting it as tables and different data types. <br>

Input and output typically fall into a few main categories: 
- reading text files and other more efficient on-disk formats, 
- loading data from databases, and 
- interacting with network sources like web APIs.

## Reading and Writing Data in Text Format

Here’s a table listing the key functions in Pandas for loading text and binary data:

| **Function**                  | **Purpose**                                   | **Data Format**         |
|-------------------------------|-----------------------------------------------|-------------------------|
| `pd.read_csv()`          | Reads a comma-separated values (CSV) file.   | Text (CSV)             |
| `pd.read_table()`         | Reads a general delimited file.              | Text (Delimited)       |
| `pd.read_fwf()`          | Reads a fixed-width formatted file.          | Text (Fixed-width)     |
| `pd.read_json()`          | Reads a JSON file or JSON string.            | Text (JSON)            |
| `pd.read_html()`          | Reads tables from HTML content.              | Text (HTML)            |
| `pd.read_xml()`           | Reads XML data into a DataFrame.             | Text (XML)             |
| `pd.read_sql()`         | Reads from a SQL database.                   | Text (SQL Query/Database) |
| `pd.read_sql_query()`     | Reads the results of a SQL query.            | Text (SQL Query)       |
| `pd.read_sql_table()`     | Reads a table from a SQL database.           | Text (SQL Table)       |
| `pd.read_excel()`         | Reads data from Excel files (.xls, .xlsx).   | Binary (Excel)         |
| `pd.read_parquet()`       | Reads Parquet format files.                  | Binary (Parquet)       |
| `pd.read_feather()`       | Reads Feather format files.                  | Binary (Feather)       |
| `pd.read_sas()`           | Reads SAS data files (.sas7bdat).            | Binary (SAS)           |
| `pd.read_stata()`         | Reads Stata files (.dta).                    | Binary (Stata)         |
| `pd.read_hdf()`          | Reads HDF5 format files.                     | Binary (HDF5)          |
| `pd.read_pickle()`        | Reads pickled object files.                  | Binary (Pickle)        |
| `pd.read_orc()`           | Reads ORC format files.                      | Binary (ORC)           |
| `pd.read_sqlite()`        | Reads SQLite database files.                 | Binary (SQLite)        |

Each function has various parameters to customize the data loading process, such as specifying separators, columns, or handling missing data. Let me know if you'd like details on a specific function!

### Key Categories of optional arguments used with Pandas functions for loading text data into a DataFrame

1. Indexing
   - **Column selection**: Treat columns as index or infer names from the file.
   - **Header control**: Define where column names come from (file or manual).

2. Type Inference and Data Conversion
   - **Conversions**: Map or function to convert values to specific types.
   - **Missing values**: Customize which values are treated as NaN (e.g., `NA`, `null`).
   
3. Date and Time Parsing
   - **Date combination**: Combine separate date and time columns.
   - **Custom format**: Specify custom date/time formats.

4. Iterating
   - **Chunked loading**: Process large files in chunks to avoid memory overload.

5. Unclean Data Issues
   - **Skip rows/columns**: Exclude irrelevant or unclean data like comments or headers.
   - **Handling numbers**: Manage numbers with thousand separators (e.g., commas).

### Summary:
Functions like `read_csv()` have many options for customizing data import, including type inference, date parsing, and handling large or messy data. Though the number of parameters can seem overwhelming, the Pandas documentation provides many examples for fine-tuning.

In [44]:
import numpy as np 
import pandas as pd 

In [45]:
# let's start with a small comma-separated values (CSV) text file

# since it is comma-delimited, we can use pandas.read_csv to read it into a DataFrame

df = pd.read_csv("examples/ex1.csv")
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [46]:
# file will not always have a header row

pd.read_csv("examples/ex1.csv", header=None) # default column names

Unnamed: 0,0,1,2,3,4
0,a,b,c,d,message
1,1,2,3,4,hello
2,5,6,7,8,world
3,9,10,11,12,foo


In [47]:
pd.read_csv("examples/ex2.csv", names=['a', 'b', 'c', 'd', 'message'])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [48]:
# suppose wanted the message col to be the index of the returned DataFrame
# either indicate you want the col at index 4 or named "message" using index_col argument

names = ["a", "b", "c", "d", "message"]
pd.read_csv("examples/ex2.csv", names=names, index_col="message")

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In [49]:
# !(later) want to form a hierarchical index from multiple columns, 
# pass a list of column numbers or names

parsed = pd.read_csv("examples/csv_mindex.csv", index_col=["key1", "key2"])
parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In [50]:
# In some cases, a table might not have a fixed delimeter, using whitespace or some other pattern to separate fields

result = pd.read_csv("examples/ex3.csv", sep="\s+")
result

# Because there was one fewer column name than the number of data rows,
# pandas.read_csv infers that the first column should be the DataFrame’s index in this special case

  result = pd.read_csv("examples/ex3.csv", sep="\s+")


Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


In [51]:
# the file parsing functions have many additional arguments .. 

# to skip some rows of a file with skiprows:

pd.read_csv("examples/ex4.csv", skiprows=[0,2,3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [52]:
# Handling missing values is an important and frequently nuanced part of the file reading process
# Missing data is usually either not present (empty string) or marked by some sentinel (placeholder) value
# By default, pandas uses a set of commonly occuring sentinels, such as NA and NULL

result = pd.read_csv("examples/ex5.csv")
result # two NaN values

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [53]:
pd.isna(result)

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,True,False,False
2,False,False,False,False,False,False


In [54]:
# the na_values option accepts a sequence of strings to add to the default list of strings recognized as missing

result = pd.read_csv("examples/ex5.csv", na_values=["NULL"])
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [55]:
# the defaults can be disabled with the keep_default_na option

result2 = pd.read_csv("examples/ex5.csv", keep_default_na=False)
result2 

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [56]:
result2.isna() # no missing values

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False


In [57]:
result3 = pd.read_csv("examples/ex5.csv", keep_default_na=False, na_values=["NA"])
result3

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [58]:
result3.isna()

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,False,False,False
2,False,False,False,False,False,False


In [59]:
# Different NA sentinels can be specified for each column in a dictionary:

sentinels = {'message': ['foo', 'NA'], 'something': ['two']}
pd.read_csv("examples/ex5.csv", na_values=sentinels, keep_default_na=False)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


####  some arguments in pandas.read_csv function

| Argument        | Description                                                                                        |
|----------------------|---------------------------------------------------------------------------------------------------------|
| `path`           | String indicating filesystem location, URL, or file-like object.                                         |
| `sep`            | Character sequence or regular expression to split fields in each row.                                    |
| `header`         | Row number to use as column names (default is 0). Set to `None` if there is no header row.               |
| `index_col`     | Column numbers or names to use as the row index; can be a single value or a list for a hierarchical index. |
| `names`          | List of column names to use for the result.                                                             |
| `skiprows`       | Number of rows at the beginning of the file to ignore or list of row numbers to skip.                   |
| `na_values`      | Sequence of values to replace with NA. Added to the default list unless `keep_default_na=False`.         |
| `keep_default_na`| Whether to use the default NA value list (`True` by default).                                            |
| `comment`        | Character(s) used to split comments off the end of lines.                                                |
| `parse_dates`   | Attempt to parse data as datetime. If `True`, attempts to parse all columns; otherwise, specify columns or a list. |
| `keep_date_col` | If joining columns for date parsing, keep the joined columns (default is `False`).                       |
| `converters`     | Dictionary of column numbers/names mapped to functions (e.g., `{"foo": f}` applies function `f` to column `foo`). |
| `dayfirst`       | For ambiguous dates, treat as international format (e.g., `7/6/2012` as June 7, 2012).                   |
| `date_parser`    | Function to use for parsing dates.                                                                       |
| `nrows`          | Number of rows to read from the beginning of the file (not counting header).                            |
| `iterator`       | Return a TextFileReader object for reading the file piecemeal, which can be used with a `with` statement. |
| `chunksize`      | For iteration, defines the size of file chunks.                                                         |
| `skip_footer`   | Number of lines to ignore at the end of the file.                                                       |
| `verbose`        | Print various parsing details, such as the time spent and memory use information.                        |
| `encoding`       | Text encoding (e.g., `"utf-8"`). Defaults to `"utf-8"` if `None`.                                        |
| `squeeze`        | If the parsed data contains only one column, return it as a Series.                                      |
| `thousands`      | Separator for thousands (e.g., `,` or `.`). Default is `None`.                                           |
| `decimal`        | Decimal separator in numbers (e.g., `.` or `,`). Default is `"."`.                                       |
| `engine`         | CSV parsing engine to use. Options: `"c"`, `"python"`, or `"pyarrow"`. Default is `"c"`.                |

These options provide a broad range of control over how CSV files are read, from handling missing values and date parsing to specifying which columns to use.

## [ Reading Text Files in Pieces ]

In [60]:
# make the pandas display settings more compact

pd.options.display.max_rows = 10

result = pd.read_csv("examples/ex6.csv")
result

# the elipsis marks ... indicate that rows in the middle of the DataFrame have been omitted

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.501840,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
...,...,...,...,...,...
9995,2.311896,-0.417070,-1.409599,-0.515821,L
9996,-0.479893,-0.650419,0.745152,-0.646038,E
9997,0.523331,0.787112,0.486066,1.093156,K
9998,-0.362559,0.598894,-1.843201,0.887292,G


In [61]:
# to read only a small number of rows (avoiding reading the entire file) specify that with `nrows`
pd.read_csv("examples/ex6.csv", nrows=7)

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
5,1.81748,0.742273,0.419395,-2.251035,Q
6,-0.776764,0.935518,-0.332872,-1.875641,U


In [62]:
# to read a file in pieces, specify a chunksize as a number of rows
chunker = pd.read_csv("examples/ex6.csv", chunksize=1000)
type(chunker)

pandas.io.parsers.readers.TextFileReader

In [63]:
# The TextFileReader object returned by pandas.read_csv allows you to iterate over the parts of the file according to the chunksize

# file will be read in chunks instead of loading the entire file into memory
chunker = pd.read_csv("examples/ex6.csv", chunksize=1000)   
tot = pd.Series([], dtype='int64')      # initialized as an empty Series

# for each chunk, value_counts() method is called on the "key" column
for piece in chunker:   
    # adds the counts from the current chunk to the cumulative tot series    
    # fill_values=0 ensure that missing values are treated as 0 during addition
    tot = tot.add(piece["key"].value_counts(), fill_value=0)        
tot = tot.sort_values(ascending=False)

tot[:10]

key
E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

In [64]:
# get_chunk method lets you manually retrieve a specific-sized chunk of rows from the file,
# giving more control over how the file is processed

# use cases: 
# custom sized chunks
# iterative processing
# dynamic control 


reader = pd.read_csv("examples/ex6.csv", chunksize=1000)
# chunk1 = reader.get_chunk(500)  # Reads the first 500 rows
# chunk2 = reader.get_chunk(700)  # Reads the next 700 rows
# chunk3 = reader.get_chunk(300)  # Reads the next 300 rows

total_counts = pd.Series(dtype='int64')

chunk_size = [500,700,300]
for size in chunk_size:
    chunk = reader.get_chunk(size)  # read the next chunk of specified size
    total_counts = total_counts.add(chunk["key"].value_counts(), fill_value=0)
    total_counts = total_counts.sort_values(ascending=False)

total_counts[:7]

key
O    65.0
S    63.0
Q    60.0
I    60.0
R    56.0
F    55.0
X    55.0
dtype: float64

## [Writing Data to Text Format ]

data can also be exported to a delimited format

In [65]:
# example

data = pd.read_csv("examples/ex5.csv")
print(data) # dataframe

# converting it to csv file by using to_csv method
data.to_csv("examples/out.csv")

  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo


In [66]:
# sys.stdout prints the text result to the console rather than a file

import sys
data.to_csv(sys.stdout, sep="|")

# missing values will appear as empty strings in the output

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo


In [67]:
# to denote them by some other sentinel value

data.to_csv(sys.stdout, na_rep="NULL")

,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo


In [68]:
# With no other options specified, both the row and column labels are written. Both of these can be disabled

data.to_csv(sys.stdout, index=False, header=False)

one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo


In [69]:
# to write only a subset of the columns, and in an order of your choosing

data.to_csv(sys.stdout, index=False, columns=["a", "b", "c"])

a,b,c
1,2,3.0
5,6,
9,10,11.0


## [ Working with Other Delimited Formats ]
- In some cases, some manual processing may be necessary.
- Files with one or more malformed lines that trip up pandas.read_csv

In [70]:
# CSV module
# For any file with a single-character delimiter, you can use Python's built-in CSV module
# to use it pass any open file or file-like object to csv.reader

import csv 
f = open("examples/ex7.csv")
reader = csv.reader(f)

for line in reader:
    print(line)

['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']


In [74]:
# It's up to you to do the wrangling necessary to put the data in the form that you need.
# Let's take this step by step. First, we read the file into a list of lines

with open("examples/ex7.csv") as f:
    lines = list(csv.reader(f))

    # split the lines into the header line and the data line
    header = lines[0]
    values = lines[1:]

    # create a dictionary of data columns using a dictionary comprehension and the expression zip(*values) 
    # {used for unzipping iterable structures, separates grouped elements into individual sequences}
    # it converts rows into columns (transposing)
    data_dict = dict(zip(header, zip(*values)))
data_dict

{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

#### CSV Dialect
- CSV files come in many different flavors. 
- To define a new format with a different delimiter, string quoting convention, or line terminator, we could define a simple subclass of `csv.Dialect`.
- `csv.Dialect` is a class in Python's `csv` module that defines a set of formatting rules for reading and writing CSV files.
- Can be registered and reused for multiple CSV files.

- Some predefined Dialects
    - `csv.excel`	--> Uses `,` as a delimiter (default CSV format for Excel)
    - `csv.unix_dialect` --> Uses `,` with `\n` as line terminator

###  `csv.Dialect` Attributes Table
Here’s a table of attributes that can be set when defining a `csv.Dialect` in Python:

| Attribute         | Description | Possible Values |
|------------------|-------------|----------------|
| `delimiter`      | Character used to separate fields | Default: `,` (comma), e.g., `;`, `\t`, `|` |
| `quotechar`      | Character used to quote fields | Default: `"` (double quotes), e.g., `'` (single quote) |
| `doublequote`    | Whether double quotes inside a field should be doubled | `True` (default), `False` |
| `escapechar`     | Character used for escaping special characters | Default: `None`, e.g., `\` (backslash) |
| `lineterminator` | Character used to terminate lines | Default: `\r\n` (Windows), `\n` (Unix) |
| `quoting`        | Controls when quoting is used | `csv.QUOTE_MINIMAL` (default), `csv.QUOTE_ALL`, `csv.QUOTE_NONNUMERIC`, `csv.QUOTE_NONE` |
| `skipinitialspace` | Whether to skip spaces after delimiters | `True`, `False` (default) |
| `strict`         | Raises an error on bad CSV formatting | `True`, `False` (default) |

---


###  Common `quoting` Values
| Constant | Description |
|----------|------------|
| `csv.QUOTE_MINIMAL` | (Default) Only quote fields when necessary |
| `csv.QUOTE_ALL` | Quote **every** field |
| `csv.QUOTE_NONNUMERIC` | Quote **only non-numeric** fields |
| `csv.QUOTE_NONE` | **No** quoting (use `escapechar` instead) |


In [79]:
# demonstration



In [81]:
# For files with more complicated or fixed multicharacter delimiters, you will not be able to use the csv module. 
# In those cases, you'll have to do the line splitting and other cleanup using the string's split method or the regular expresssion method re.split.

# Thankfully pandas.read_csv is capable of doing almost anything you need if you pass the necessary options, so you only rarely will have to parse files by hand.