# Data Loading, Storage, and File Formats

*Data loading* term refers to read data and making it accessible. *Parsing* is also used to describe loading text data and interpreting it as tables and different data types.

## Index

* [Reading and Writing Data in Text Format](#reading-and-writing-data-in-text-format)
    * [pandas.read_csv Functions](#some-pandasread_csv-function-arguments)


## Reading and Writing Data in Text Format

*Text and binary data loading functions in pandas*
|Function|Description|
|---|---|
|**read_csv** |**Load delimited data from a file, URL, or file-like object; use comma as default delimiter**|
|read_fwf |Read data in fixed-width column format (i.e., no delimiters)|
|read_clipboard |Variation of read_csv that reads data from the clipboard; useful for converting tables from web pages|
|read_excel |Read tabular data from an Excel XLS or XLSX file|
|read_hdf |Read HDF5 files written by pandas|
|read_html |Read all tables found in the given HTML document|
|**read_json** |**Read data from a JSON (JavaScript Object Notation) string representation, file, URL, or file-like object**|
|read_feather |Read the Feather binary file format|
|read_orc |Read the Apache ORC binary file format|
|read_parquet |Read the Apache Parquet binary file format|
|read_pickle |Read an object stored by pandas using the Python pickle format|
|read_sas |Read a SAS dataset stored in one of the SAS system’s custom storage formats|
|**read_spss** |Read a data file created by SPSS|
|read_sql |Read the results of a SQL query (using SQLAlchemy)|
|read_sql_table |Read a whole SQL table (using SQLAlchemy); equivalent to using a query that selects everything in that table using read_sql|
|read_stata |Read a dataset from Stata file format|
|read_xml |Read a table of data from an XML file|

Some of this functions has a long list of optional arguments, `pandas.read_csv()` has around 50, so ig you are struggling to read a particular file you can look online to found your optimal arguments.

```python
# some examples:

# The csv file has not headers, it will read with default names
pd.read_csv("example.csv", header=None)
# or you can specify names
pd.read_csv("example.csv", names["a", "b", "c", "d", "message"],
            index_col="message")
# the argument 'index_col="message"' to indicate your index column

# Pass multiple col names (list) for a hierarchical index

# You can pass a regular expression as delimeter for pandas
# use sep="\s+" if the file are separated for a variable amount of whitespaces

# with skiprows=[2, 3, 5] you can skip that rows
pd.read_csv("example.csv", skiprows=[2, 3, 5])
```

#### *Some pandas.read_csv function arguments*
|Argument|Description|
|---|---|
|path |String indicating filesystem location, URL, or file-like object.|
|sep or delimiter |Character sequence or regular expression to use to split fields in each row.|
|header |Row number to use as column names; defaults to 0 (first row), but should be None if there is no header row.|
|index_col |Column numbers or names to use as the row index in the result; can be a single name/number or a list of them for a hierarchical index.|
|names |List of column names for result.|
|skiprows| Number of rows at beginning of file to ignore or list of row numbers (starting from 0) to skip.|
|na_values |Sequence of values to replace with NA. They are added to the default list unless keep_default_na=False is passed.|
|keep_default_na |Whether to use the default NA value list or not (True by default).|
|comment| Character(s) to split comments off the end of lines.|
|parse_dates |Attempt to parse data to datetime; False by default. If True, will attempt to parse all columns. Otherwise, can specify a list of column numbers or names to parse. If element of list is tuple or list, will combine multiple columns together and parse to date (e.g., if date/time split across two columns).|
|keep_date_col |If joining columns to parse date, keep the joined columns; False by default.|
|converters |Dictionary containing column number or name mapping to functions (e.g., {"foo": f} would apply the function f to all values in the "foo" column).|
|dayfirst |When parsing potentially ambiguous dates, treat as international format (e.g., 7/6/2012 -> June 7, 2012); False by default.|
|date_parser |Function to use to parse dates.|
|nrows |Number of rows to read from beginning of file (not counting the header).|
|iterator |Return a TextFileReader object for reading the file piecemeal. This object can also be used with the with statement.|
|chunksize |For iteration, size of file chunks.|
|skip_footer |Number of lines to ignore at end of file.|
|verbose |Print various parsing information, like the time spent in each stage of the file conversion and memory use information.|
|encoding |Text encoding (e.g., "utf-8 for UTF-8 encoded text). Defaults to "utf-8" if None.|
|squeeze |If the parsed data contains only one column, return a Series.|
|thousands |Separator for thousands (e.g., "," or "."); default is None.|
|decimal| Decimal separator in numbers (e.g., "." or ","); default is ".".|
|engine |CSV parsing and conversion engine to use; can be one of "c", "python", or "pyarrow". The default is "c", though the newer "pyarrow" engine can parse some files much faster. The "python" engine is slower but supports some features that the other engines do not.|