# Read a CSV file

The one with all the commas.  Except when they're using tabs or pipes (|).

And a CSV isn't always a nice set of headers then a set of data rows.  
* Sometimes they have more stuff at the top and bottom of the file. 
* Sometimes they have data in columns that don't have headers
* Sometimes strings don't have "" around them

etc. etc. There are many ways to mess up a CSV file.  And Crowdtangle tries a whole bunch of them

In [1]:
# Load in the libraries you need.  Point at the input file

import csv
import pandas as pd

infileroot = 'data/crowdtangle_extension/naturalnews.com_2021-10-19_18est'
infile = infileroot + '.csv'

In [22]:
# Use the CSV library to read the file
rawdata = []
with open(infile, newline='') as f:
    reader = csv.reader(f)
    count = 0
    for row in reader:
        print('row {}: {} {}'.format(count, len(row), row))
        count += 1
        rawdata += [row]

row 0: 1 ['Link: https://www.naturalnews.com/']
row 1: 4 ['Facebook Interactions', 'Facebook Reactions', 'Facebook Shares', 'Facebook Comments']
row 2: 4 ['0', '0', '0', '0']
row 3: 0 []
row 4: 6 ['Source', 'Followers', 'Date', 'Interactions', 'Post Type', 'Link']
row 5: 6 ['Fox News Sean Hannity', '12240', 'Mon Dec 02 2019 16:24:14 GMT+0000', '50453', 'Facebook', 'https://www.facebook.com/groups/1879196809009182/permalink/2446101075652083']
row 6: 6 ['IUBIM ORTODOXIA', '153020', 'Tue Jan 29 2019 18:45:56 GMT+0000', '1446', 'Facebook', 'https://www.facebook.com/769866573059648/posts/2113547865358172']
row 7: 6 ['José Luis Camacho Espina', '0', 'Thu Sep 28 2017 18:36:09 GMT+0000', '1302', 'Facebook', 'https://www.facebook.com/335909243281118/posts/647968232075216']
row 8: 6 ['Dr. Christiane Northrup', '0', 'Thu May 30 2013 02:47:02 GMT+0000', '1193', 'Facebook', 'https://www.facebook.com/118912795028/posts/10152857815890029']
row 9: 6 ['Vaccine Awareness South Africa - VASA', '6890', 'W

In [17]:
# At which point we can force this data into a pandas dataframe
pd.DataFrame(rawdata)

Unnamed: 0,0,1,2,3,4,5,6,7
0,Link: https://www.naturalnews.com/,,,,,,,
1,Facebook Interactions,Facebook Reactions,Facebook Shares,Facebook Comments,,,,
2,0,0,0,0,,,,
3,,,,,,,,
4,Source,Followers,Date,Interactions,Post Type,Link,,
...,...,...,...,...,...,...,...,...
389,Info that make's you think,0,Tue Sep 06 2011 13:20:57 GMT+0000,19,Facebook,https://www.facebook.com/136412389704486/posts...,,
390,AFRICAN MEDITATION GROUP,0,Thu May 16 2019 22:04:55 GMT+0000,19,Facebook,https://www.facebook.com/groups/35002210871338...,,
391,Holistic Heights,0,Tue Aug 30 2011 15:54:15 GMT+0000,19,Facebook,https://www.facebook.com/175805415778107/posts...,,
392,Oplysning om Aspartam,6988,Sun Oct 10 2021 12:37:52 GMT+0000,4,Facebook,https://www.facebook.com/607222032673858/posts...,,


In [6]:
# THIS WILL FAIL... because... 
# read_csv thinks the first row in a csv file is the column headings
# it also thinks that every row after that should have less entries 
# than the number of headings
# And our file has 4 rows of information before the table starts
pd.read_csv(infile)

ParserError: Error tokenizing data. C error: Expected 4 fields in line 5, saw 6


In [18]:
# So we skip the first 4 rows, and pandas correctly reads the column headings
# But this will also fail, because
# Crowdtangle doesn't put "" around strings. 
# Which means that every string with a comma in it starts a new column
# So there are more entries in those rows than column headings
pd.read_csv(infile, skiprows=4)

ParserError: Error tokenizing data. C error: Expected 6 fields in line 264, saw 7


In [33]:
# This is what's supposed to happen.  
# The first 258 rows don't have commas in the text
pd.read_csv(infile, skiprows=4, nrows=258)

Unnamed: 0,Source,Followers,Date,Interactions,Post Type,Link
0,Fox News Sean Hannity,12240,Mon Dec 02 2019 16:24:14 GMT+0000,50453,Facebook,https://www.facebook.com/groups/18791968090091...
1,IUBIM ORTODOXIA,153020,Tue Jan 29 2019 18:45:56 GMT+0000,1446,Facebook,https://www.facebook.com/769866573059648/posts...
2,José Luis Camacho Espina,0,Thu Sep 28 2017 18:36:09 GMT+0000,1302,Facebook,https://www.facebook.com/335909243281118/posts...
3,Dr. Christiane Northrup,0,Thu May 30 2013 02:47:02 GMT+0000,1193,Facebook,https://www.facebook.com/118912795028/posts/10...
4,Vaccine Awareness South Africa - VASA,6890,Wed Nov 06 2019 12:31:24 GMT+0000,1162,Facebook,https://www.facebook.com/398819000228205/posts...
...,...,...,...,...,...,...
253,AWAKE LIKE ME,2993,Wed Oct 09 2019 22:39:56 GMT+0000,33,Facebook,https://www.facebook.com/groups/91889017153571...
254,GMOLOL,0,Wed Dec 23 2015 03:02:24 GMT+0000,33,Facebook,https://www.facebook.com/groups/24581465222518...
255,Foodrising,0,Tue Dec 18 2018 01:06:37 GMT+0000,32,Facebook,https://www.facebook.com/609497599105238/posts...
256,Gegenwind Deutschland,1872,Sun Nov 17 2019 21:40:17 GMT+0000,32,Facebook,https://www.facebook.com/groups/43299174009569...
