***
Welcome! 

In this Notebook we will explore how to read external files into Python - particularly CSV and TXT files.
***

### Index:

[1.1 - Reading CSV Files](#1.1---Reading-CSV-Files)
<br>
[1.2 - Reading TXT Files](#1.2---Reading-TXT-Files)


In [1]:
# Let's load the Pandas library
import pandas as pd

# Importing the csv library
import csv

### 1.1 - Reading CSV Files

Let's first check the folders we have available on where the notebook is located:

In [2]:
!dir

 Volume in drive C is Windows-SSD
 Volume Serial Number is 745F-D042

 Directory of C:\Users\ivopb\Google Drive\Courses Instruction\Natural Language Processing in Python\05 - Importing Text Data to Python

19/04/2021  18:18    <DIR>          .
19/04/2021  18:18    <DIR>          ..
19/04/2021  18:19    <DIR>          .ipynb_checkpoints
20/03/2021  20:05            39�511 01 - Reading Data from CSV and TXT Files.ipynb
23/03/2021  23:38         1�389�319 02 - Scraping Data from the Web.ipynb
27/03/2021  01:24            14�126 03 - Using API's.ipynb
19/04/2021  18:19    <DIR>          data
03/04/2021  22:00            10�798 Exercises - Importing Text Data - Solutions.ipynb
03/04/2021  22:02            10�365 Exercises - Importing Text Data.ipynb
19/04/2021  18:20    <DIR>          exercise_data
               5 File(s)      1�464�119 bytes
               5 Dir(s)  65�761�685�504 bytes free


I have a folder name data next to this notebook - this is where our file is located:

In [3]:
!dir data

 Volume in drive C is Windows-SSD
 Volume Serial Number is 745F-D042

 Directory of C:\Users\ivopb\Google Drive\Courses Instruction\Natural Language Processing in Python\05 - Importing Text Data to Python\data

19/04/2021  18:19    <DIR>          .
19/04/2021  18:19    <DIR>          ..
16/02/2004  02:49             4�316 cv042_10982.txt
16/12/2019  21:36           987�712 tweets_data.csv
               2 File(s)        992�028 bytes
               2 Dir(s)  65�751�945�216 bytes free


Inside this folder, we have a *tweets_data.csv file*, let's load it using pandas and using python base:

Using pandas is really simple:

In [4]:
# We can just provide a path to the read_csv function - the
# returning element is a pandas dataframe
pd.read_csv('./data/tweets_data.csv', sep=',')

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


In [5]:
with open('./data/tweets_data.csv', mode='r') as csv_file:
    csv_f = csv.DictReader(csv_file)  
    file_data = []
    for row in csv_f:
        file_data.append(dict(row))

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 2245: character maps to <undefined>

Encodings can sometime give us some trouble:

In [6]:
with open('./data/tweets_data.csv', mode='r', encoding="utf-8") as csv_file:
    csv_f = csv.DictReader(csv_file)  
    file_data = []
    for row in csv_f:
        file_data.append(dict(row))


In [7]:
file_data[2]

{'id': '5',
 'keyword': '',
 'location': '',
 'text': "All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected",
 'target': '1'}

If we would want to read the text from this dictionary:

In [8]:
file_data[2]['text']

"All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected"

### 1.2 - Reading TXT Files

To read txt files into Python we can also use the same logic:
    - Read using the Pandas Library;
    - Read using base Python;

In [9]:
# Reading some text files using Pandas (if it is not a comma separated
# text) may get weird:
pd.read_csv('./data/cv042_10982.txt', sep = '\t')

Unnamed: 0,will hunting ( matt damon ) is a natural genius .
0,"for a movie character , that's usually a death..."
1,it's a trait associated with what my brother c...
2,"forgive me for spoiling the ending , but will ..."
3,this is no formula movie .
4,"in fact , it's quite fresh and original ."
5,"it's a character study more than anything , an..."
6,will works whatever kind of job he can get .
7,"first he's a janitor , then he works construct..."
8,off-screen he speed reads books on any academi...
9,"on-screen he hangs out with his friends , pick..."


In [10]:
f = open("./data/cv042_10982.txt", "r")
print(f.read())
# After using read you can't assign the text
# to a value
text = f.read()
f.close()

will hunting ( matt damon ) is a natural genius . 
for a movie character , that's usually a death sentence . 
it's a trait associated with what my brother calls " too good for this world " movies , like phenomenon or powder . 
forgive me for spoiling the ending , but will doesn't die . 
this is no formula movie . 
in fact , it's quite fresh and original . 
it's a character study more than anything , and that's not surprising , considering it was written by two actors : damon and co-star ben affleck . 
will works whatever kind of job he can get . 
first he's a janitor , then he works construction . 
off-screen he speed reads books on any academic subject that interests him . 
on-screen he hangs out with his friends , picking fights in robust , romanticized-hemingway fashion . 
lambeau ( stellan skarsgard from breaking the waves ) , a math professor , learns that the janitor ( will ) is a genius with a special talent for advanced mathematics . 
having confirmed he's not a fluke or a sava

In [11]:
f = open("./data/cv042_10982.txt", "r")
movie_review = f.read()

In [12]:
print(movie_review)

will hunting ( matt damon ) is a natural genius . 
for a movie character , that's usually a death sentence . 
it's a trait associated with what my brother calls " too good for this world " movies , like phenomenon or powder . 
forgive me for spoiling the ending , but will doesn't die . 
this is no formula movie . 
in fact , it's quite fresh and original . 
it's a character study more than anything , and that's not surprising , considering it was written by two actors : damon and co-star ben affleck . 
will works whatever kind of job he can get . 
first he's a janitor , then he works construction . 
off-screen he speed reads books on any academic subject that interests him . 
on-screen he hangs out with his friends , picking fights in robust , romanticized-hemingway fashion . 
lambeau ( stellan skarsgard from breaking the waves ) , a math professor , learns that the janitor ( will ) is a genius with a special talent for advanced mathematics . 
having confirmed he's not a fluke or a sava

We can now work this text as any other text that we have worked with, let's confirm it with the type of the object:

In [13]:
type(movie_review)

str