# Video: An Opinionated Take

This video argues for using the list of dictionaries representation when first investigating a new file or source of data.

Script:
* This may be the most opinionated video of this module.
* The question I'd like to pontificate on now is, what data structure should be used when you first read a file?
* To be clear, I am talking about reading in a text-based files with a header, and rows and columns of data that would look totally at home in a spreadsheet.
* But other than that, you don't know much about what is in the file.
* In that case, in my not so humble opinion, you should read it into a list of dictionaries.
* Why?
* You can do this in just a few lines of code using the csv module.

In [None]:
!wget https://raw.githubusercontent.com/bu-cds-omds/dx601-examples/main/data/mango-tiny.tsv

--2024-08-13 13:30:30--  https://raw.githubusercontent.com/bu-cds-omds/dx601-examples/main/data/mango-tiny.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 151 [text/plain]
Saving to: ‘mango-tiny.tsv’


2024-08-13 13:30:31 (1000 KB/s) - ‘mango-tiny.tsv’ saved [151/151]



In [None]:
!head mango-tiny.tsv

green_rating	yellow_rating	softness	wrinkles	estimated_flavor	estimated_sweetness	rated_flavor
1	5	4	0	4	4	5
1	5	5	1	5	5	1
2	4	3	1	3	3	3
3	3	2	0	2	1	2


In [None]:
import csv

In [None]:
with open("mango-tiny.tsv") as mango_fp:
    reader = csv.DictReader(mango_fp, dialect="excel-tab")
    data = list(reader)

Script:
* You've already seen them earlier this week.
* And they'll run in a second unless your file is huge.
* Then you can look at one row trivially, and see what kind of data is in the file.


In [None]:
data[0]

{'green_rating': '1',
 'yellow_rating': '5',
 'softness': '4',
 'wrinkles': '0',
 'estimated_flavor': '4',
 'estimated_sweetness': '4',
 'rated_flavor': '5'}

Script:
* The dictionary form gives you the columns front and center.
* Those are the keys.
* And right next to those keys, you can see the values.
* And while those values will probably be strings at this point, you can probably guess what types should be parsed from those strings after you look at them.
* Now that you have looked at a row, you can check whether it matches your expectations about the file.
* Are all the columns that you expected in there?
* You can randomly sample the rows, and check for common weird values in the data.



In [None]:
import random

In [None]:
random.choice(data)

{'green_rating': '1',
 'yellow_rating': '5',
 'softness': '4',
 'wrinkles': '0',
 'estimated_flavor': '4',
 'estimated_sweetness': '4',
 'rated_flavor': '5'}

Script:
* If relevant, you can do some quick summary statistics using this data structure.


In [None]:
sum(float(row['rated_flavor']) for row in data) / len(data)

2.75

Script:
* After you've taken a look at the data, you may have a better idea what to do with it.
* You may have identified something that needs cleanup in the data.
* Some use cases prefer a specific format, like a list of tuples to load into a database.
* You may want to use a Numpy array for efficiency, or a Pandas dataframe to get both the Numpy efficiency and dictionary convenience.
* Both of those will come later in this course.
* Later on in this course, the pandas library will take over this role from the csv library, for essentially the same reasons.
* You can read a file very easily, and see sample data quickly.
* Occasionally, if you receive a broken file, you will want to switch back to the CSV module to debug issues if pandas has trouble parsing the file, since the CSV module does less processing.
* With either module, it is important to look at a sample of the data before you write much code using it.