# Working with CSV

CSV: Use comma/semicolon to **separate/delimit** values in a row and next line character to separate different rows \
TSV: Use horizontal tab character to **separate/delimit** values in a row and next line character to separate different rows

Each row corresponds to a dictionary (or object later on). \
CSV/TSV cannot be nested. \

CSV is very strong at delivering data of the same entity type at scale.


In [7]:
filename = "biostats.csv"

with open(filename) as reader:
    content = reader.readlines()

print(content) # note the horizontal tabs

['"Name"\t"Sex"\t"Age"\t"Height (in)"\t"Weight (lbs)"\n', '"Alex"\t"M"\t41\t74\t170\n', '"Bert"\t"M"\t42\t68\t166\n', '"Carl"\t"M"\t32\t70\t155\n', '"Dave"\t"M"\t39\t72\t167\n', '"Elly"\t"F"\t30\t66\t124\n', '"Fran"\t"F"\t33\t66\t115\n', '"Gwen"\t"F"\t26\t64\t121\n', '"Hank"\t"M"\t30\t71\t158\n', '"Ivan"\t"M"\t53\t72\t175\n', '"Jake"\t"M"\t32\t69\t143\n', '"Kate"\t"F"\t47\t69\t139\n', '"Luke"\t"M"\t34\t72\t163\n', '"Myra"\t"F"\t23\t62\t98\n', '"Neil"\t"M"\t36\t75\t160\n', '"Omar"\t"M"\t38\t70\t145\n', '"Page"\t"F"\t31\t67\t135\n', '"Quin"\t"M"\t29\t71\t176\n', '"Ruth"\t"F"\t28\t65\t131\n']


## Use `csv` library
For small datasets, this is OK.

In [13]:
import csv

# delimiter -- ',' or ';' or '\t'
persons = csv.DictReader(content, delimiter="\t")
# get column headers
persons.fieldnames

['Name', 'Sex', 'Age', 'Height (in)', 'Weight (lbs)']

In [14]:
# If an object has __iter__ attribute, we can use for-loop
persons.__iter__

<bound method DictReader.__iter__ of <csv.DictReader object at 0x105935280>>

In [18]:
persons = csv.DictReader(content, delimiter="\t")
for person in persons:
    print(person)

{'Name': 'Alex', 'Sex': 'M', 'Age': '41', 'Height (in)': '74', 'Weight (lbs)': '170'}
{'Name': 'Bert', 'Sex': 'M', 'Age': '42', 'Height (in)': '68', 'Weight (lbs)': '166'}
{'Name': 'Carl', 'Sex': 'M', 'Age': '32', 'Height (in)': '70', 'Weight (lbs)': '155'}
{'Name': 'Dave', 'Sex': 'M', 'Age': '39', 'Height (in)': '72', 'Weight (lbs)': '167'}
{'Name': 'Elly', 'Sex': 'F', 'Age': '30', 'Height (in)': '66', 'Weight (lbs)': '124'}
{'Name': 'Fran', 'Sex': 'F', 'Age': '33', 'Height (in)': '66', 'Weight (lbs)': '115'}
{'Name': 'Gwen', 'Sex': 'F', 'Age': '26', 'Height (in)': '64', 'Weight (lbs)': '121'}
{'Name': 'Hank', 'Sex': 'M', 'Age': '30', 'Height (in)': '71', 'Weight (lbs)': '158'}
{'Name': 'Ivan', 'Sex': 'M', 'Age': '53', 'Height (in)': '72', 'Weight (lbs)': '175'}
{'Name': 'Jake', 'Sex': 'M', 'Age': '32', 'Height (in)': '69', 'Weight (lbs)': '143'}
{'Name': 'Kate', 'Sex': 'F', 'Age': '47', 'Height (in)': '69', 'Weight (lbs)': '139'}
{'Name': 'Luke', 'Sex': 'M', 'Age': '34', 'Height (in)

## Using `pandas` library

This method is good for larger datasets.

In [19]:
%pip install pandas

You should consider upgrading via the '/Users/binh_dh/.pyenv/versions/3.9.7/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [20]:
import pandas as pd

sales = pd.read_csv("1000000 Sales Records.csv")
sales

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
0,Sub-Saharan Africa,South Africa,Fruits,Offline,M,7/27/2012,443368995,7/28/2012,1593,9.33,6.92,14862.69,11023.56,3839.13
1,Middle East and North Africa,Morocco,Clothes,Online,M,9/14/2013,667593514,10/19/2013,4611,109.28,35.84,503890.08,165258.24,338631.84
2,Australia and Oceania,Papua New Guinea,Meat,Offline,M,5/15/2015,940995585,6/4/2015,360,421.89,364.69,151880.40,131288.40,20592.00
3,Sub-Saharan Africa,Djibouti,Clothes,Offline,H,5/17/2017,880811536,7/2/2017,562,109.28,35.84,61415.36,20142.08,41273.28
4,Europe,Slovakia,Beverages,Offline,L,10/26/2016,174590194,12/4/2016,3973,47.45,31.79,188518.85,126301.67,62217.18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999995,Sub-Saharan Africa,Senegal,Baby Food,Offline,L,11/6/2010,575470578,12/11/2010,3387,255.28,159.42,864633.36,539955.54,324677.82
999996,Central America and the Caribbean,Panama,Office Supplies,Offline,C,1/12/2015,766942107,3/1/2015,4068,651.21,524.96,2649122.28,2135537.28,513585.00
999997,Europe,Norway,Office Supplies,Online,M,10/25/2011,685472047,12/5/2011,5266,651.21,524.96,3429271.86,2764439.36,664832.50
999998,Europe,Montenegro,Beverages,Offline,M,10/31/2010,946734225,12/8/2010,8551,47.45,31.79,405744.95,271836.29,133908.66


In [21]:
# get information of the dataset
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 14 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   Region          1000000 non-null  object 
 1   Country         1000000 non-null  object 
 2   Item Type       1000000 non-null  object 
 3   Sales Channel   1000000 non-null  object 
 4   Order Priority  1000000 non-null  object 
 5   Order Date      1000000 non-null  object 
 6   Order ID        1000000 non-null  int64  
 7   Ship Date       1000000 non-null  object 
 8   Units Sold      1000000 non-null  int64  
 9   Unit Price      1000000 non-null  float64
 10  Unit Cost       1000000 non-null  float64
 11  Total Revenue   1000000 non-null  float64
 12  Total Cost      1000000 non-null  float64
 13  Total Profit    1000000 non-null  float64
dtypes: float64(5), int64(2), object(7)
memory usage: 106.8+ MB


In [25]:
# get values of a column
countries = sales["Country"]
countries

0             South Africa
1                  Morocco
2         Papua New Guinea
3                 Djibouti
4                 Slovakia
                ...       
999995             Senegal
999996              Panama
999997              Norway
999998          Montenegro
999999           Nicaragua
Name: Country, Length: 1000000, dtype: object

In [26]:
# get distinct values of a column
distinct_countries = sales["Country"].drop_duplicates()
distinct_countries

0               South Africa
1                    Morocco
2           Papua New Guinea
3                   Djibouti
4                   Slovakia
               ...          
704                 Honduras
708    Republic of the Congo
761                    Spain
804               San Marino
836                     Fiji
Name: Country, Length: 185, dtype: object

In [29]:
# see if a value is in a column
countries = sales["Country"]
countries_named_vietnam = countries[countries == "Vietnam"] # = with condition
countries_named_vietnam

14        Vietnam
652       Vietnam
654       Vietnam
765       Vietnam
934       Vietnam
           ...   
999202    Vietnam
999245    Vietnam
999260    Vietnam
999307    Vietnam
999539    Vietnam
Name: Country, Length: 5367, dtype: object

In [30]:
# get rows with a column matching a value
# e.g. get all rows having country = Vietnam
sales[sales["Country"] == "Vietnam"]

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
14,Asia,Vietnam,Personal Care,Online,M,4/4/2010,314505374,5/6/2010,7984,81.73,56.67,652532.32,452453.28,200079.04
652,Asia,Vietnam,Office Supplies,Online,C,12/14/2015,558181273,12/26/2015,7098,651.21,524.96,4622288.58,3726166.08,896122.50
654,Asia,Vietnam,Baby Food,Online,L,1/24/2017,193490970,2/11/2017,7133,255.28,159.42,1820912.24,1137142.86,683769.38
765,Asia,Vietnam,Personal Care,Offline,H,3/28/2011,357008302,5/2/2011,5545,81.73,56.67,453192.85,314235.15,138957.70
934,Asia,Vietnam,Clothes,Offline,M,3/21/2014,248213183,4/9/2014,652,109.28,35.84,71250.56,23367.68,47882.88
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999202,Asia,Vietnam,Clothes,Online,C,5/18/2011,128185832,6/15/2011,1891,109.28,35.84,206648.48,67773.44,138875.04
999245,Asia,Vietnam,Cosmetics,Offline,M,8/29/2016,230315339,9/2/2016,4282,437.20,263.33,1872090.40,1127579.06,744511.34
999260,Asia,Vietnam,Beverages,Offline,C,2/19/2015,912964141,4/3/2015,5232,47.45,31.79,248258.40,166325.28,81933.12
999307,Asia,Vietnam,Office Supplies,Offline,C,3/2/2013,712284696,3/3/2013,9725,651.21,524.96,6333017.25,5105236.00,1227781.25


In [31]:
# get n first rows
sales.head(15)

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
0,Sub-Saharan Africa,South Africa,Fruits,Offline,M,7/27/2012,443368995,7/28/2012,1593,9.33,6.92,14862.69,11023.56,3839.13
1,Middle East and North Africa,Morocco,Clothes,Online,M,9/14/2013,667593514,10/19/2013,4611,109.28,35.84,503890.08,165258.24,338631.84
2,Australia and Oceania,Papua New Guinea,Meat,Offline,M,5/15/2015,940995585,6/4/2015,360,421.89,364.69,151880.4,131288.4,20592.0
3,Sub-Saharan Africa,Djibouti,Clothes,Offline,H,5/17/2017,880811536,7/2/2017,562,109.28,35.84,61415.36,20142.08,41273.28
4,Europe,Slovakia,Beverages,Offline,L,10/26/2016,174590194,12/4/2016,3973,47.45,31.79,188518.85,126301.67,62217.18
5,Asia,Sri Lanka,Fruits,Online,L,11/7/2011,830192887,12/18/2011,1379,9.33,6.92,12866.07,9542.68,3323.39
6,Sub-Saharan Africa,Seychelles,Beverages,Online,M,1/18/2013,425793445,2/16/2013,597,47.45,31.79,28327.65,18978.63,9349.02
7,Sub-Saharan Africa,Tanzania,Beverages,Online,L,11/30/2016,659878194,1/16/2017,1476,47.45,31.79,70036.2,46922.04,23114.16
8,Sub-Saharan Africa,Ghana,Office Supplies,Online,L,3/23/2017,601245963,4/15/2017,896,651.21,524.96,583484.16,470364.16,113120.0
9,Sub-Saharan Africa,Tanzania,Cosmetics,Offline,L,5/23/2016,739008080,5/24/2016,7768,437.2,263.33,3396169.6,2045547.44,1350622.16


In [32]:
# last 15 rows
sales.tail(15)

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
999985,Asia,North Korea,Office Supplies,Online,M,4/12/2014,624163186,4/27/2014,2612,651.21,524.96,1700960.52,1371195.52,329765.0
999986,Sub-Saharan Africa,Republic of the Congo,Cosmetics,Offline,H,2/4/2013,788361155,2/9/2013,2440,437.2,263.33,1066768.0,642525.2,424242.8
999987,Asia,Bhutan,Fruits,Online,H,3/12/2012,640066754,4/15/2012,8831,9.33,6.92,82393.23,61110.52,21282.71
999988,Central America and the Caribbean,Trinidad and Tobago,Office Supplies,Offline,L,5/9/2014,541828811,5/22/2014,8041,651.21,524.96,5236379.61,4221203.36,1015176.25
999989,Europe,Austria,Baby Food,Online,M,10/28/2015,476508653,11/20/2015,8354,255.28,159.42,2132609.12,1331794.68,800814.44
999990,Sub-Saharan Africa,Zimbabwe,Meat,Online,C,2/23/2010,497280108,3/23/2010,5090,421.89,364.69,2147420.1,1856272.1,291148.0
999991,Europe,Netherlands,Fruits,Offline,C,8/28/2014,277629506,9/5/2014,5595,9.33,6.92,52201.35,38717.4,13483.95
999992,Middle East and North Africa,Iran,Personal Care,Online,H,8/27/2013,461355674,9/28/2013,4251,81.73,56.67,347434.23,240904.17,106530.06
999993,Europe,Armenia,Personal Care,Offline,C,3/28/2015,412569940,4/26/2015,7468,81.73,56.67,610359.64,423211.56,187148.08
999994,Europe,Germany,Office Supplies,Online,C,3/16/2017,465696132,4/4/2017,8689,651.21,524.96,5658363.69,4561377.44,1096986.25


In [35]:
# convert a dataset to a list of dictionaries (same output as DictReader)
first_15_rows = sales.head(15)
first_15_rows.to_dict(orient="records")


[{'Region': 'Sub-Saharan Africa',
  'Country': 'South Africa',
  'Item Type': 'Fruits',
  'Sales Channel': 'Offline',
  'Order Priority': 'M',
  'Order Date': '7/27/2012',
  'Order ID': 443368995,
  'Ship Date': '7/28/2012',
  'Units Sold': 1593,
  'Unit Price': 9.33,
  'Unit Cost': 6.92,
  'Total Revenue': 14862.69,
  'Total Cost': 11023.56,
  'Total Profit': 3839.13},
 {'Region': 'Middle East and North Africa',
  'Country': 'Morocco',
  'Item Type': 'Clothes',
  'Sales Channel': 'Online',
  'Order Priority': 'M',
  'Order Date': '9/14/2013',
  'Order ID': 667593514,
  'Ship Date': '10/19/2013',
  'Units Sold': 4611,
  'Unit Price': 109.28,
  'Unit Cost': 35.84,
  'Total Revenue': 503890.08,
  'Total Cost': 165258.24,
  'Total Profit': 338631.84},
 {'Region': 'Australia and Oceania',
  'Country': 'Papua New Guinea',
  'Item Type': 'Meat',
  'Sales Channel': 'Offline',
  'Order Priority': 'M',
  'Order Date': '5/15/2015',
  'Order ID': 940995585,
  'Ship Date': '6/4/2015',
  'Units Sol

Homework:
* Download [Iris dataset CSV](https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv)
* Read the CSV using `csv` and `pandas`
* From the output, create a list of dicts that only belong to `Versicolor` flower type. (`variety = 'Versicolor'`)