# Plain text and binary files

What's the difference between .csv and .xlsx and why should you care?

A brief introduction, with thanks to Practical Data Science book: https://www.practicaldatascience.org/notebooks/class_3/week_3/03_plaintext_files.html


## Things can look different from how they seem

1. The SAME data can be stored in DIFFERENT kinds of files
2. The SAME file can look DIFFERENT depending on how you look at it

All computer files are actually binary files, series of 0s and 1s. An encoding tells computer programs how to interpret those 0s and 1s in a useful way.

A plain text file is one that is encoded as text, so that every computer program will know that it is text. The main encodings are called ASCII and Unicode (see https://www.geeksforgeeks.org/ascii-vs-unicode/).

On your computer, you probably have text editors Notepad (Windows) or TextEdit (Mac). 

## Plain text files

Might look like this:

```
Alice
Bob
Charmaine
Deepak
```

Or this:

```
|\---/|
| o_o |
 \_^_/
```

That's an example of "ascii art", from https://www.asciiart.eu/animals/cats

Plain text files have *characters*, like A, b, T, -, /, ^, etc. Separated by line breaks, which are actually encoded invisibly as text using `\n`.

Python code is written in plain text!

In this course there's 2 main plaintext formats we'll care about, that encode data.

- Comma Separated Values (CSVs): plaintext files that use the file suffix .csv. In these files, each row of text represents one row in the data, and columns are separated by commas.
- Tab Separated Values (TSVs): plaintext files that usually use the file suffix .txt or, less commonly, .tsv. In these files, each row of text represents one row in the data, and columns are separated by tabs (the special character denoting an indentation).

What a .csv looks like:

```
Country,Region,Population
UK,"Europe",67596281
Brazil,"S. America",203080756
Tuvalu,"Oceania",11900
```

What the same data look like in .tsv:
```
Country\tRegion\tPopulation
UK\t"Europe"\t67596281
Brazil\t"S. America"\t203080756
Tuvalu\t"Oceania"\t11900
```

Here, `\t` means "tab" as in "tab-separated". In fact, the end of a line is encoded with a "\n", the backslash "\" works to indicate a special character.

In all cases, we have to tell Python or any other program to open the data in the right way.

In [None]:
print('Country\tRegion\tPopulation\n',
'UK\t"Europe"\t67596281\n'
'Brazil\t"S. America"\t203080756\n'
'Tuvalu\t"Oceania"\t11900\n')

# What's a binary file?

Everything else! .xlsx, .docx, .jpeg, .xml, etc.

Binary files tend to be stored more efficiently. They can contain more information, using less disk space. Or they can contain lots of formatting information. But they can also be less portable between different programs.

For example, a portion of the data above from an .xlsx Excel spreadsheet looks like this:

```
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac xr xr2 xr3" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2" xmlns:xr3="http://schemas.microsoft.com/office/spreadsheetml/2016/revision3" xr:uid="{37925CF0-9FB6-DC4E-81C0-CCD26F80CB69}"><dimension ref="A1:C4"/><sheetViews><sheetView tabSelected="1" workbookViewId="0"><selection activeCell="A2" sqref="A2"/></sheetView></sheetViews><sheetFormatPr baseColWidth="10" defaultRowHeight="16" x14ac:dyDescent="0.2"/><sheetData><row r="1" spans="1:3" x14ac:dyDescent="0.2"><c r="A1" t="s"><v>0</v></c><c r="B1" t="s"><v>1</v></c><c r="C1" t="s"><v>2</v></c></row><row r="2" spans="1:3" x14ac:dyDescent="0.2"><c r="A2" t="s"><v>8</v></c><c r="B2" t="s"><v>3</v></c><c r="C2"><v>67596281</v></c></row><row r="3" spans="1:3" x14ac:dyDescent="0.2"><c r="A3" t="s"><v>4</v></c><c r="B3" t="s"><v>5</v></c><c r="C3"><v>203080756</v></c></row><row r="4" spans="1:3" x14ac:dyDescent="0.2"><c r="A4" t="s"><v>6</v></c><c r="B4" t="s"><v>7</v></c><c r="C4"><v>11900</v></c></row></sheetData><pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/></worksheet>
```

## The bottom line

Check the format of your data! Check you have read it into data analysis as intended!

## An example: data frame to different filetypes

In [None]:
# Import pandas library
import pandas as pd

In [None]:
# initialize list of lists
data = [['alice', 10], ['bob', 15], ['charmaine', 15], ['deepak', 14]]

# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'Age'])

# print dataframe.
print(df)

In [None]:
# print the output in .csv format just in the notebook
df.to_csv()

In [None]:
# print the output in .tsv format using a tab-separator
df.to_csv(sep='\t')

In [None]:
# write to file
df.to_csv(path_or_buf='kids_ages.txt', sep='\t')

## Exercises to know more

Read the 'kids_ages.txt' file with different programs.
Try some other datasets.

Find the help page for pandas `to_excel`, `read_excel`, and check that you can get data in and out of excel formats too.