# Grabbing files

When we scrape and download documents (or if we receive a cache of documents), we'll need to read and analyze them.

We might want quantify elements in the unstructured data and transfer to a spreadsheet. There are different ways to do that, but first a document has to be **read**.


- [Download the sample files](https://drive.google.com/file/d/1wPknfzabti49EKxv9zAFkso_g2HrniQp/view?usp=share_link) we will need and place in the same folder as your ```.ipynb``` file.

In [1]:
## importing libraries
import glob

In [2]:
## grab a single csv file
glob.glob("demo-docs/fla_count_as_of_2020-08-19_time_12_31_00.csv")

['demo-docs/fla_count_as_of_2020-08-19_time_12_31_00.csv']

In [3]:
## grab only csv files where the filenames start with fl
glob.glob("demo-docs/fl*.csv")
# glob.glob("demo-docs/fl*") also works to get all files regardless of filetype

['demo-docs/fla_count_as_of_2020-08-19_time_11_46_00.csv',
 'demo-docs/fla_count_as_of_2020-08-19_time_12_16_00.csv',
 'demo-docs/fla_count_as_of_2020-08-19_time_12_31_00.csv',
 'demo-docs/fla_count_as_of_2020-08-19_time_12_01_00.csv']

In [4]:
## grab all pdf files
glob.glob("demo-docs/*.pdf")

['demo-docs/adolph-coors-2015.pdf',
 'demo-docs/adolph-coors-2014.pdf',
 'demo-docs/adolph-coors-2013.pdf']

In [5]:
## grab all the files!
glob.glob("demo-docs/*")

['demo-docs/fla_count_as_of_2020-08-19_time_11_46_00.csv',
 'demo-docs/fla_count_as_of_2020-08-19_time_12_16_00.csv',
 'demo-docs/adolph-coors-2015.pdf',
 'demo-docs/adolph-coors-2014.pdf',
 'demo-docs/adolph-coors-2013.pdf',
 'demo-docs/fla_count_as_of_2020-08-19_time_12_31_00.csv',
 'demo-docs/read_sample2.txt',
 'demo-docs/read_sample1.txt',
 'demo-docs/pa_count_as_of_2020-08-19_time_11_31_00.csv',
 'demo-docs/fla_count_as_of_2020-08-19_time_12_01_00.csv']

In [6]:
# grab text files only
my_text_files = glob.glob("demo-docs/*txt")
my_text_files

['demo-docs/read_sample2.txt', 'demo-docs/read_sample1.txt']

In [7]:
my_text_files[0]

'demo-docs/read_sample2.txt'

# Reading files

We can interpret `<class '_io.TextIOWrapper'>` to read actual contents using the followng:

`.read()` reads some entire document

`.read(some_integer)` gives you a certain number of characters

`.readline()` gives the first line, shows each line as part of a list

`.readline(some_integer)` returns the next line up to the total number of returned bytes

In [8]:
with open("demo-docs/read_sample2.txt", "r") as my_text:
    print(type(my_text))

<class '_io.TextIOWrapper'>


In [9]:
## read and print entire file
with open(my_text_files[0], "r") as my_text:
    print(my_text.read())

New Zealand to Reduce Covid Self-Isolation Period to Seven Days
ByTracy Withers

New Zealand will reduce the isolation period for Covid-19 cases and their household contacts to seven days in order to get more people back to work.

The period will reduce from 10 days effective at 11:59 p.m. on Friday March 11 in Wellington, Minister for Covid Response Chris Hipkins said in a statement.



In [10]:
## read and print entire file
with open(my_text_files[0], "r") as my_text:
    print(my_text.read(50)) # gives you the first 50 characters

New Zealand to Reduce Covid Self-Isolation Period 


# Saving myfile object inside a variable

In [11]:
## read and hold the firsr 50 characters in a variable
with open(my_text_files[1], "r") as my_text:
    text50 = my_text.read(50)

In [12]:
text50

'McDonald’s, Coca-Cola Hit Pause on Russia Amid Ris'

In [13]:
## read the first line into a variable
with open(my_text_files[1], "r") as my_text:
    first_line = my_text.readline()

In [14]:
first_line

'McDonald’s, Coca-Cola Hit Pause on Russia Amid Rising Backlash\n'

In [15]:
## read the whole thing into a variable
with open(my_text_files[1], "r") as my_text:
    all_text = my_text.read()

In [16]:
all_text

'McDonald’s, Coca-Cola Hit Pause on Russia Amid Rising Backlash\nBy Leslie Patton and Brendan Case\n\nMcDonald’s Corp., Coca-Cola Co. and Starbucks Corp. are temporarily halting business operations in Russia amid an intensifying backlash since the invasion of Ukraine started nearly two weeks ago. \n\nThe iconic U.S. brands, seen around the world as the face of U.S. capitalism, announced their decisions in a flurry of announcements on Tuesday afternoon, joining hundreds of other global companies that have halted work in Russia since the war began. PepsiCo Inc. said it would suspend soft drink sales in Russia but would continue to sell daily essentials such as milk and baby formula.\n'

# It's more useful to save the text object inside a list

In [17]:
## store text file into a list
with open(my_text_files[1], "r") as my_text:
    all_text_list = my_text.readlines()

In [18]:
all_text_list

['McDonald’s, Coca-Cola Hit Pause on Russia Amid Rising Backlash\n',
 'By Leslie Patton and Brendan Case\n',
 '\n',
 'McDonald’s Corp., Coca-Cola Co. and Starbucks Corp. are temporarily halting business operations in Russia amid an intensifying backlash since the invasion of Ukraine started nearly two weeks ago. \n',
 '\n',
 'The iconic U.S. brands, seen around the world as the face of U.S. capitalism, announced their decisions in a flurry of announcements on Tuesday afternoon, joining hundreds of other global companies that have halted work in Russia since the war began. PepsiCo Inc. said it would suspend soft drink sales in Russia but would continue to sell daily essentials such as milk and baby formula.\n']

# We can then slice the list

In [19]:
headline = all_text_list[0]
headline

'McDonald’s, Coca-Cola Hit Pause on Russia Amid Rising Backlash\n'

In [20]:
byline = all_text_list[1]
byline.rstrip()

'By Leslie Patton and Brendan Case'