Reading data and making it accessible (often called data loading) is a very important. <br>
The term parsing is also sometimes used to describe loading text data and interpreting it as tables and different data types. <br>

Input and output typically fall into a few main categories: 
- reading text files and other more efficient on-disk formats, 
- loading data from databases, and 
- interacting with network sources like web APIs.

## Reading and Writing Data in Text Format

Here’s a table listing the key functions in Pandas for loading text and binary data:

| **Function**                  | **Purpose**                                   | **Data Format**         |
|-------------------------------|-----------------------------------------------|-------------------------|
| `pd.read_csv()`          | Reads a comma-separated values (CSV) file.   | Text (CSV)             |
| `pd.read_table()`         | Reads a general delimited file.              | Text (Delimited)       |
| `pd.read_fwf()`          | Reads a fixed-width formatted file.          | Text (Fixed-width)     |
| `pd.read_json()`          | Reads a JSON file or JSON string.            | Text (JSON)            |
| `pd.read_html()`          | Reads tables from HTML content.              | Text (HTML)            |
| `pd.read_xml()`           | Reads XML data into a DataFrame.             | Text (XML)             |
| `pd.read_sql()`         | Reads from a SQL database.                   | Text (SQL Query/Database) |
| `pd.read_sql_query()`     | Reads the results of a SQL query.            | Text (SQL Query)       |
| `pd.read_sql_table()`     | Reads a table from a SQL database.           | Text (SQL Table)       |
| `pd.read_excel()`         | Reads data from Excel files (.xls, .xlsx).   | Binary (Excel)         |
| `pd.read_parquet()`       | Reads Parquet format files.                  | Binary (Parquet)       |
| `pd.read_feather()`       | Reads Feather format files.                  | Binary (Feather)       |
| `pd.read_sas()`           | Reads SAS data files (.sas7bdat).            | Binary (SAS)           |
| `pd.read_stata()`         | Reads Stata files (.dta).                    | Binary (Stata)         |
| `pd.read_hdf()`          | Reads HDF5 format files.                     | Binary (HDF5)          |
| `pd.read_pickle()`        | Reads pickled object files.                  | Binary (Pickle)        |
| `pd.read_orc()`           | Reads ORC format files.                      | Binary (ORC)           |
| `pd.read_sqlite()`        | Reads SQLite database files.                 | Binary (SQLite)        |

Each function has various parameters to customize the data loading process, such as specifying separators, columns, or handling missing data. Let me know if you'd like details on a specific function!

### Key Categories of optional arguments used with Pandas functions for loading text data into a DataFrame

1. Indexing
   - **Column selection**: Treat columns as index or infer names from the file.
   - **Header control**: Define where column names come from (file or manual).

2. Type Inference and Data Conversion
   - **Conversions**: Map or function to convert values to specific types.
   - **Missing values**: Customize which values are treated as NaN (e.g., `NA`, `null`).
   
3. Date and Time Parsing
   - **Date combination**: Combine separate date and time columns.
   - **Custom format**: Specify custom date/time formats.

4. Iterating
   - **Chunked loading**: Process large files in chunks to avoid memory overload.

5. Unclean Data Issues
   - **Skip rows/columns**: Exclude irrelevant or unclean data like comments or headers.
   - **Handling numbers**: Manage numbers with thousand separators (e.g., commas).

### Summary:
Functions like `read_csv()` have many options for customizing data import, including type inference, date parsing, and handling large or messy data. Though the number of parameters can seem overwhelming, the Pandas documentation provides many examples for fine-tuning.

In [1]:
import numpy as np 
import pandas as pd 

In [4]:
# let's start with a small comma-separated values (CSV) text file

# since it is comma-delimited, we can use pandas.read_csv to read it into a DataFrame

df = pd.read_csv("examples/ex1.csv")
df

Unnamed: 0,0,1,2,3,4
0,a,b,c,d,message
1,1,2,3,4,hello
2,5,6,7,8,world
3,9,10,11,12,foo


In [3]:
# file will not always have a header row

pd.read_csv("examples/ex2.csv", header=None) # default column names

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [5]:
pd.read_csv("examples/ex2.csv", names=['a', 'b', 'c', 'd', 'message'])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [7]:
# suppose wanted the message col to be the index of the returned DataFrame
# either indicate you wnt the col at index 4 or named "message" using index_col argument

names = ["a", "b", "c", "d", "message"]
pd.read_csv("examples/ex2.csv", names=names, index_col="message")

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In [8]:
# !(later) want to form a hierarchical index from multiple columns, 
# pass a list of column numbers or names

parsed = pd.read_csv("examples/csv_mindex.csv", index_col=["key1", "key2"])
parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In [11]:
# In some cases, a table might not have a fixed delimeter, using whitespace or some other pattern to separate fields

result = pd.read_csv("examples/ex3.csv", sep="\s+")
result

# Because there was one fewer column name than the number of data rows,
# pandas.read_csv infers that the first column should be the DataFrame’s index in this special case

  result = pd.read_csv("examples/ex3.csv", sep="\s+")


Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


In [12]:
# the file parsing functions have many additional arguments .. 

# to skip some rows of a file with skiprows:

pd.read_csv("examples/ex4.csv", skiprows=[0,2,3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [13]:
# Handling missing values is an important and frequently nuanced part of the file reading process
# Missing data is usually either not present (empty string) or marked by some sentinel (placeholder) value
# By default, pandas uses a set of commonly occuring sentinels, such as NA and NULL

result = pd.read_csv("examples/ex5.csv")
result # two NaN values

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [14]:
pd.isna(result)

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,True,False,False
2,False,False,False,False,False,False


In [15]:
# the na_values option accepts a sequence of strings to add to the default list of strings recognized as missing

result = pd.read_csv("examples/ex5.csv", na_values=["NULL"])
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [16]:
# the defaults can be disabled with the keep_default_na option

result2 = pd.read_csv("examples/ex5.csv", keep_default_na=False)
result2 

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [17]:
result2.isna() # no missing values

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False


In [18]:
result3 = pd.read_csv("examples/ex5.csv", keep_default_na=False, na_values=["NA"])
result3

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [19]:
result3.isna()

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,False,False,False
2,False,False,False,False,False,False


In [20]:
# Different NA sentinels can be specified for each column in a dictionary:

sentinels = {'message': ['foo', 'NA'], 'something': ['two']}
pd.read_csv("examples/ex5.csv", na_values=sentinels, keep_default_na=False)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


#### list of options in pandas.read_csv

| Argument        | Description                                                                                        |
|----------------------|---------------------------------------------------------------------------------------------------------|
| `path`           | String indicating filesystem location, URL, or file-like object.                                         |
| `sep`            | Character sequence or regular expression to split fields in each row.                                    |
| `header`         | Row number to use as column names (default is 0). Set to `None` if there is no header row.               |
| `index_col`     | Column numbers or names to use as the row index; can be a single value or a list for a hierarchical index. |
| `names`          | List of column names to use for the result.                                                             |
| `skiprows`       | Number of rows at the beginning of the file to ignore or list of row numbers to skip.                   |
| `na_values`      | Sequence of values to replace with NA. Added to the default list unless `keep_default_na=False`.         |
| `keep_default_na`| Whether to use the default NA value list (`True` by default).                                            |
| `comment`        | Character(s) used to split comments off the end of lines.                                                |
| `parse_dates`   | Attempt to parse data as datetime. If `True`, attempts to parse all columns; otherwise, specify columns or a list. |
| `keep_date_col` | If joining columns for date parsing, keep the joined columns (default is `False`).                       |
| `converters`     | Dictionary of column numbers/names mapped to functions (e.g., `{"foo": f}` applies function `f` to column `foo`). |
| `dayfirst`       | For ambiguous dates, treat as international format (e.g., `7/6/2012` as June 7, 2012).                   |
| `date_parser`    | Function to use for parsing dates.                                                                       |
| `nrows`          | Number of rows to read from the beginning of the file (not counting header).                            |
| `iterator`       | Return a TextFileReader object for reading the file piecemeal, which can be used with a `with` statement. |
| `chunksize`      | For iteration, defines the size of file chunks.                                                         |
| `skip_footer`   | Number of lines to ignore at the end of the file.                                                       |
| `verbose`        | Print various parsing details, such as the time spent and memory use information.                        |
| `encoding`       | Text encoding (e.g., `"utf-8"`). Defaults to `"utf-8"` if `None`.                                        |
| `squeeze`        | If the parsed data contains only one column, return it as a Series.                                      |
| `thousands`      | Separator for thousands (e.g., `,` or `.`). Default is `None`.                                           |
| `decimal`        | Decimal separator in numbers (e.g., `.` or `,`). Default is `"."`.                                       |
| `engine`         | CSV parsing engine to use. Options: `"c"`, `"python"`, or `"pyarrow"`. Default is `"c"`.                |

These options provide a broad range of control over how CSV files are read, from handling missing values and date parsing to specifying which columns to use.