# Grabbing files

When we scrape and download documents (or if we receive a cache of documents), we'll need to read and analyze them.

We might want quantify elements in the unstructured data and transfer to a spreadsheet. There are different ways to do that, but first a document has to be **read**.


- [Download the sample files](https://drive.google.com/file/d/1wPknfzabti49EKxv9zAFkso_g2HrniQp/view?usp=share_link) we will need and place in the same folder as your ```.ipynb``` file.

# ```glob```

```glob``` is a UNIX-based library for collecting specific files into a list.

## Using a path

We need to store to a variable our path from our ```.ipynb``` file to folder that holds our files.


In [1]:
## import libraries
import glob

In [7]:
## grab a single csv file
glob.glob ("demo-docs/fla_count_as_of_2020-08-19_time_11_31_00.csv") 
# hay un elemento y un método glob (por eso glob.glob
#devuelve una lista

['demo-docs/fla_count_as_of_2020-08-19_time_11_31_00.csv']

### The power of ```glob``` comes from its ability to gather any target files we want.

In [10]:
## grab only the csv files
glob.glob ("demo-docs/*.csv")

['demo-docs/fla_count_as_of_2020-08-19_time_11_46_00.csv',
 'demo-docs/fla_count_as_of_2020-08-19_time_12_16_00.csv',
 'demo-docs/fla_count_as_of_2020-08-19_time_11_31_00.csv',
 'demo-docs/fla_count_as_of_2020-08-19_time_12_31_00.csv',
 'demo-docs/fla_count_as_of_2020-08-19_time_12_01_00.csv']

In [11]:
## grab all the pdf files
## grab only the csv files
glob.glob ("demo-docs/*.pdf")

['demo-docs/adolph-coors-2015.pdf',
 'demo-docs/adolph-coors-2014.pdf',
 'demo-docs/adolph-coors-2013.pdf']

In [13]:
## puedo hacer que venga cualquier aque diga adolph

glob.glob ("demo-docs/adolph*.pdf")

['demo-docs/adolph-coors-2015.pdf',
 'demo-docs/adolph-coors-2014.pdf',
 'demo-docs/adolph-coors-2013.pdf']

In [14]:
## grab all the files! 
## grab only the csv files
glob.glob ("demo-docs/*.*")

['demo-docs/fla_count_as_of_2020-08-19_time_11_46_00.csv',
 'demo-docs/fla_count_as_of_2020-08-19_time_12_16_00.csv',
 'demo-docs/adolph-coors-2015.pdf',
 'demo-docs/adolph-coors-2014.pdf',
 'demo-docs/fla_count_as_of_2020-08-19_time_11_31_00.csv',
 'demo-docs/adolph-coors-2013.pdf',
 'demo-docs/fla_count_as_of_2020-08-19_time_12_31_00.csv',
 'demo-docs/read_sample2.txt',
 'demo-docs/read_sample1.txt',
 'demo-docs/fla_count_as_of_2020-08-19_time_12_01_00.csv']

In [15]:
## grab txt files only
glob.glob ("demo-docs/*.txt")

['demo-docs/read_sample2.txt', 'demo-docs/read_sample1.txt']

# Reading files


In [20]:
## create a text wrapper object by "reading" the 'read_sample1.txt' file
## remember we are already in the test folder
with open ("demo-docs/read_sample2.txt", "r") as my_text: #la "r" es de read
    print(type(my_text))

<class '_io.TextIOWrapper'>


#### We can interpret this ```<class '_io.TextIOWrapper'>``` to read the actual contents using the following:

- ```.read()``` reads the entire document.
- ```.read(some_integer)``` reads "some_integer" number of characters.
- ```.readline()``` reads the first line. If you pass an argument, it will read that many characters in the first line.
- ```.readlines()``` reads each line as a separate item and places into a list.



In [23]:
## create a variable that holds our file name
file_name = "demo-docs/read_sample2.txt"

In [25]:
## read and print entire file
with open (file_name, "r") as my_text: #with open sirve para evitar que abra un documento y lo deje abierto (lo que pasa solo con open)
    print (my_text.read())

New Zealand to Reduce Covid Self-Isolation Period to Seven Days
ByTracy Withers

New Zealand will reduce the isolation period for Covid-19 cases and their household contacts to seven days in order to get more people back to work.

The period will reduce from 10 days effective at 11:59 p.m. on Friday March 11 in Wellington, Minister for Covid Response Chris Hipkins said in a statement.



In [26]:
## read and print 50 characters
with open (file_name, "r") as my_text: #with open sirve para evitar que abra un documento y lo deje abierto (lo que pasa solo con open)
    print (my_text.read(50))

New Zealand to Reduce Covid Self-Isolation Period 


## Saving file to memory
So far, we haven't saved the text. 
The content is only available inside ```with open```.
If we try to read the lines, outside the ```with open```, we'll get a ```ValueError: I/O operation on closed file.```

In [28]:
## print my_text(60)
## this will break!
print (my_text.read(60))
#no existe fuera de open

ValueError: I/O operation on closed file.

## We fix that my saving the myfile object inside a variable

In [31]:
## read hold the first 25 characters in a variable
with open (file_name, "r") as my_text: #with open sirve para evitar que abra un documento y lo deje abierto (lo que pasa solo con open)
    text_25 =  my_text.read(25)

In [32]:
## call the variable above
text_25

'New Zealand to Reduce Cov'

In [41]:
## read the first line into a variable
with open (file_name, "r") as my_text: #otras funciones son "w" que escribe o "a" que agrega (sin sobreescribir)
    first_line = my_text.readlines(1)

In [42]:
## call the variable above
first_line

['New Zealand to Reduce Covid Self-Isolation Period to Seven Days\n']

In [39]:
## read the whole thing into a variable
with open (file_name, "r") as my_text: #otras funciones son "w" que escribe o "a" que agrega (sin sobreescribir)
    all_text = my_text.read()

In [40]:
## call the variable above
all_text

'New Zealand to Reduce Covid Self-Isolation Period to Seven Days\nByTracy Withers\n\nNew Zealand will reduce the isolation period for Covid-19 cases and their household contacts to seven days in order to get more people back to work.\n\nThe period will reduce from 10 days effective at 11:59 p.m. on Friday March 11 in Wellington, Minister for Covid Response Chris Hipkins said in a statement.\n'

## It's more useful to save the text object inside a list. 
Remember, ```readlines()``` actually shows each line as part of a list.

In [43]:
## store entire text file in list
with open (file_name, "r") as my_text: #otras funciones son "w" que escribe o "a" que agrega (sin sobreescribir)
    all_text_list = my_text.readlines()

In [44]:
## call list
all_text_list

['New Zealand to Reduce Covid Self-Isolation Period to Seven Days\n',
 'ByTracy Withers\n',
 '\n',
 'New Zealand will reduce the isolation period for Covid-19 cases and their household contacts to seven days in order to get more people back to work.\n',
 '\n',
 'The period will reduce from 10 days effective at 11:59 p.m. on Friday March 11 in Wellington, Minister for Covid Response Chris Hipkins said in a statement.\n']


## We can then slice our list

In [46]:
## Show list item 3
all_text_list[2]

'\n'

In [47]:
## create a variable to hold the byline
byline = all_text_list[1]

In [48]:
byline

'ByTracy Withers\n'