# Loading data from tsv files using `pandas`

## What is a `tsv` file?

`tsv` is a file extension for a tab-delimited file used with spreadsheet software and as the database export format for some databases (e.g., PostgreSQL). `tsv` stands for Tab Separated Values. `tsv` files are used for raw data and can be imported into and exported from spreadsheet software. `tsv` files are essentially text files, and the raw data can be viewed by text editors, though they are often used when moving raw data between spreadsheets.

### Table of Contents

1. [Reading the `tsv` file with `pandas`](#json_pandas)
1. [Reading a `tsv` file with `csv` (optional)](#file_read)
1. [Exploring the file contents with `csv` (optional)](#file_explore)


----

### About the Data 

**Located:** `/dsa/data/all_datasets/people_db.tsv`

This is a file containing data on various people that includes the following information per column.

Attribute | Description
----------|------------
`NAME`      | Person's Name
`COMPANY`   | Company name
`POSITION_TITLE  `  | Position
`ADDRESS      `  | Street Address 
`CITY`  | City
`STATE   `  | State
`ZIP    `  | Zip Code
`PHONE_NUMBER `  | Phone #
`EMAIL `  | Email Address


<a id='json_pandas'></a>
## Reading the `tsv` file with `pandas`

The `csv` library isn't the only library capable of handling the `tsv` format. `pandas` can also handle `tsv` and it reads the data into the easy-to-read data frame object. 
Not only is it more visually appealing for humans, but the data frame also provides flexible data manipulation capabilities, the ability to story records of differing data types (ie. `string`, `int`), and intuitive indexing. 
Also, reading in the file is as simple as calling the `read_csv()` method and pass the file as the argument and **sep='\t'** as the second argument.

In [None]:
import pandas as pd

df = pd.read_csv('/dsa/data/all_datasets/people_db.tsv',  sep='\t')

df.head()

And now we have a data frame where the keys of the `file` list's items become the header of and the values of each record become a row in the frame.

All things you have learned or will learn using Pandas data frames are now available on the data.

<a id='file_read'></a>
## Reading a `tsv` file with `csv` (optional)

The `csv` `Python` library has a mode for reading in `TSV` files; using `delimiter="\t"`. The `\t` is the special character code for **tab**, which we will use to split the data points of our `people_db.tsv` file. 

In [None]:
import csv

with open('/dsa/data/all_datasets/people_db.tsv') as f:
    file = csv.reader(f, delimiter="\t")
    for row in file:
        # You can print the file column by column
        # The print command here prints the first three columns with tab delimiting
        #print(row[0]+"\t \t"+row[1]+"\t \t"+row[2])
        
        #This command prints the entire row
        print(row)

Notice how we used the `with open(...) as` syntax. 
What this does is does is opens our file, processes its contents and then closes it after the final function. 
In this case, we open the **'/dsa/data/all_datasets/people_db.tsv'** file, load it into memory with the `csv.reader()` method by assigning it to the variable `file`, and then finally, print out the `file` contents by iterating through all the rows. 

At the moment, it looks a little messy, but let's take a closer look at its contents.

<a id='file_explore'></a>
## Exploring the file contents with csv (optional)

In the previous code block, we read in the contents of the **people_db.tsv** file and saved it to a variable called, `file`. 
We can now access different components of this variable, but first we should see what we are working with. 
Let's take a look at the data inside the file.

**people_db.tsv**

Note: Some spaces have been added to the end of the each line below to make the data format properly as markdown.

NAME    COMPANY POSITION_TITLE  ADDRESS CITY    STATE   ZIP     PHONE_NUMBER    EMAIL  
Eleanore Larson Connelly, Medhurst and Berge    'National Data Administrator'   11013 Kian Circle       Ledafort        Ohio    24916-1917      965-324-2199    eleanore@example.org  
Jada Considine  Ryan-Goyette    'Legacy Security Developer'     882 Clara Ferry Grahamport      Pennsylvania    57914-3699      1-185-804-8474 x49870   jada@example.com  
Jackson Herman  Hyatt, Swaniawski and Roob      'Dynamic Mobility Producer'     83650 Hilpert Burgs     Kertzmannberg   Washington      81109   909.070.4467    jackson@example.org  
Roxanne Orn     Paucek, Nikolaus and Watsica    'Principal Intranet Assistant'  278 Satterfield Wells   West Erin       Rhode Island    61798-2573      1-986-220-7139  roxanne@example.org  
Carroll Crona   Gottlieb-Fadel  'National Infrastructure Specialist'    421 Reilly Ridges       New Asia        Arizona 27675   (980) 737-2140 x47295   carroll@example.org  

One of the things to take note of with a tsv file is that each data point in each line is separated by the tab character. This is the defining characteristic of a tsv file.

Unlike other file formats, such as JSON (upcoming labs), where data is natively stored in key value pairs, `tsv` files have an implied dictionary structure. 
The top row is usually filled with field names, and each subsequent row contains the a single data record. 
Let's take a look at how we can have the data print out as name - value pairs using csv.DictReader. 

The resulting output looks like a JSON file, you will notice. 


In [None]:
#file[0:3]
import csv
# Read in the dictionary
input_file = csv.DictReader(open("/dsa/data/all_datasets/people_db.tsv"), delimiter="\t")
# I could simply print all the rows, but there are a lot of them
# So I use another for loop to limit things to five rows. 
i = 1
for row in input_file:
    print(row)
    if i <5:
        i=i+1
    else:
        break
    

In the output above, each new record is encapsulated by curly brackets ** `{  }` **. 
We can see that in each record, data are stored in key-value pairs where the key is the attribute and the value is the assigned value for that record. 
Keep in mind that we can access an attributes value by referencing the key of that particular record. 
Imagine that we wanted to grab the value of the "metro" attribute of the first record.

In [None]:
import csv
from collections import defaultdict

#Create a list of field values:
columns = defaultdict(list) # each value in each column is appended to a list
with open('/dsa/data/all_datasets/people_db.tsv') as f:
    reader = csv.DictReader(f, delimiter="\t") # read rows into a dictionary format
    for row in reader: # read a row as {column1: value1, column2: value2,...}
        for (k,v) in row.items(): # go over each column name and value 
            columns[k].append(v) # append the value into the appropriate list
                                 # based on column name k

#Print values for one column. 
print(columns['POSITION_TITLE'])


In the above example, we have to iterate over the fields to populate the dictionary values. 
When the iteration is done, we can specify showing all the values for  `['STATE']`, for example.

# Save your notebook, then `File > Close and Halt`