<img src = https://i.imgur.com/37LETOR.png>

**CSV (comma-separated value)** files are a common file format for transferring and storing data. The ability to read, manipulate, and write data to and from CSV files using Python is a key skill to master for any data scientist or automation engineer. In this video, we’ll go over what CSV files are, how to read CSV files into Pandas DataFrames, and the pros and cons of using CSV files.

There's a file called cereal.csv that contains nutrition data on 80 cereals that you need to download beforehand. Do that [here](https://github.com/codingwithshawnyt/codingwithshawn/tree/main/Data%20Science%20and%20Machine%20Learning%20(Python)%20Series/How%20to%20Read%20Data%20Files%20Using%20PANDAS).

<h3>The first step to working with comma-separated-value (CSV) files is understanding the concept of file types and file extensions.</h3>

Data is stored on your computer in individual “files”, or containers, each with a different name. Each file contains data of different types – the internals of a Jupyter Notebook (.ipynb) is quite different from the internals of an MS Word Document.

Computers determine how to read files using the “file extension”, that is the code that follows the dot (“.”) in the filename.


So, a filename is typically in the form "random-name.file-extension”. Examples:

* <b>project1.DOCX</b> - an Word Doc called project1
* <b>CodingWithShawn.txt</b> - a text file named CodingWithShawn
* <b>Linux_Tux.jpg</b> - an image file of Linux's Mascot, Tux  
Other well known file types and extensions include: XLSX: Excel, PDF: Portable Document Format, PNG – images, ZIP – compressed file format, GIF – animation, MPEG – video, MP3 – music etc.

A CSV file is a file with a “.csv” file extension, e.g. “data.csv”, “super_information.csv”. The “CSV” in this case lets the computer know that the data contained in the file is in “comma separated value” format, which we’ll discuss below.


File extensions are hidden by default on a lot of operating systems. The first step that any self-respecting engineer, software engineer, or data scientist will do on a new computer is to ensure that file extensions are shown in their Explorer (Windows) or Finder (Mac).

<figure>
  <img src="https://i.imgur.com/11wqIme.png" alt="A picture of where extensions are located in Windows' File Explorer" style="width:100%">
  <figcaption>A picture of where extensions are located in Windows' File Explorer</figcaption>
</figure>

In [None]:
# Load the Pandas libraries with alias 'pd' 
import pandas as pd 
 
# Read data from file 'filename.csv' 
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later) 
df = pd.read_csv("cereal.csv") 
 
# Preview the first 5 lines of the loaded data 
df.head()

<h3>Possible Parsing Errors</h3>

<ol><li><code data-enlighter-language="python">FileNotFoundError:&nbsp;File b'filename.csv' does not exist</code><br>A File Not Found error is typically an issue with path setup, current directory, or file name confusion (file extension can play a part here!)</li><li><code data-enlighter-language="python">UnicodeDecodeError:&nbsp;'utf-8' codec can't decode byte  in position : invalid continuation byte</code><br>A Unicode Decode Error is typically caused by not specifying the encoding of the file, and happens when you have a file with non-standard characters. For a quick fix, try opening the file in <a href="https://www.sublimetext.com">Sublime Text</a>, and re-saving with encoding ‘UTF-8’.</li><li><code data-enlighter-language="python">pandas.parser.CParserError: Error tokenizing data.</code><br>Parse Errors can be caused in unusual circumstances to do with your data format – try to add the parameter “engine=’python'” to the read_csv function call; this changes the data reading function internally to a slower but more stable method.</li></ol>

<h3>Advanced Concepts</h3>

<p>There are some additional flexible parameters in the Pandas read_csv() function that are useful to have in your arsenal of data science techniques:</p>

**Specifying Data Types:**
<p>As mentioned before, CSV files do not contain any type information for data. Data types are inferred through examination of the top rows of the file, which can lead to errors. To manually specify the data types for different columns, the&nbsp;<strong>dtype</strong> parameter can be used with a dictionary of column names and data types to be applied, for example:
    <span class="enlighter"><span class="enlighter-text">dtype=</span><span class="enlighter-g1">{</span><span class="enlighter-s0">"name"</span><span class="enlighter-text">: str, </span><span class="enlighter-s0">"age"</span><span class="enlighter-text">: np.int32</span><span class="enlighter-g1">}</span></span>

<p>Note that for dates and date times, the format, columns, and other behaviour can be adjusted using <strong>parse_dates</strong>, <strong>date_parser</strong>, <strong>dayfirst</strong>, <strong>keep_date&nbsp;</strong>parameters.</p>

**Skipping and Picking Rows and Columns From File:**
<p>The&nbsp;<strong>nrows</strong> parameter specifies how many rows from the top of CSV file to read, which is useful to take a sample of a large file without loading completely. Similarly the&nbsp;<b>skiprows&nbsp;</b>parameter allows you to specify rows to leave out, either at the start of the file (provide an int), or throughout the file (provide a list of row indices). Similarly, the&nbsp;<strong>usecols&nbsp;</strong>parameter can be used to specify which columns in the data to load.</p>

**Custom Missing Value Symbols:**
<p>When data is exported to CSV from different systems, missing values can be specified with different tokens. The&nbsp;<strong>na_values</strong> parameter allows you to customise the characters that are recognised as missing values. The default values interpreted as NA/NaN are:&nbsp;‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.</p>

In [None]:
# Advanced CSV loading example
'''
data = pd.read_csv(
    "data/files/complex_data_example.tsv",      # relative python path to subdirectory
    sep='\t'           # Tab-separated value file.
    quotechar="'",        # single quote allowed as quote character
    dtype={"salary": int},             # Parse the salary column as an integer 
    usecols=['name', 'birth_date', 'salary'].   # Only load the three columns specified.
    parse_dates=['birth_date'],     # Intepret the birth_date column as a date
    skiprows=10,         # Skip the first 10 rows of the file
    na_values=['.', '??']       # Take any '.' or '??' values as NA
)
'''

<h3>CSV Format Advantages and Disadvantages</h3>

As with all technical decisions, storing your data in CSV format has both advantages and disadvantages. Be aware of the potential pitfalls and issues that you will encounter as you load, store, and exchange data in CSV format:



Pros:
    <ul><li>CSV format is universal and the data can be loaded by almost any software.</li><li>CSV files are simple to understand and debug with a basic text editor</li><li>CSV files are quick to create and load into memory before analysis.</li></ul>  
Cons:
<ul><li>There is no data type information stored in the text file, all typing (dates, int vs float, strings) are inferred from the data only.</li><li>There’s no formatting or layout information storable – things like fonts, borders, column width settings from Microsoft Excel will be lost.</li><li>File encodings can become a problem if there are non-ASCII compatible characters in text fields.</li><li>CSV format is inefficient; numbers are stored as characters rather than binary values, which is wasteful. You will find however that your CSV data compresses well using <a href="https://en.wikipedia.org/wiki/Zip_(file_format)">zip compression</a>.</li></ul>

<h4>Final Thoughts:</h4>
<p>Although the CSV file is one of the most common formats for storing data, there are other file types that the modern-day data scientist must be familiar with. You now have a good sense of how useful <code>pandas</code> is when importing the CSV file, and conveniently, <code>pandas</code> offers other similar and equally handy functions to import Excel, SAS, and Stata files to name a few.</p>