In [None]:
import pandas as pd 
import json
DATA_DIR = "./data/"

# Handling different file formats

On top of varying data types data scientists also have to deal with different file formats. In this notebook we will be exploring common file formats you may encounter and how to read them for your data analysis and modeling. 

# CSV Files

CSV stands for comma separated values, a comma is used to sepparate values and usually has a `.csv` as an extension.

To read in a CSV file, you simply use the function from `pandas` and pass the filepath or filename of the CSV file.

```python
df = pd.read_csv(filename)
df.head()
```

# Excel Files

This is similar to a csv file but the file was created using Microsoft Excel. File extensions being used for this are `.xls` & `.xlsx`

```python
df = pd.read_excel(name, sheetname = 'Test')
df.head()
```

# Text Files

This file contains text without any meta-data (ex. font, font size, bold & etc). File extension usually used is `.txt`

You can read the text file line by line which gives returns a list

In [2]:
text_file = open(DATA_DIR + "test.txt", "r")

lines = text_file.readlines()
for line in lines:
    print(line)
text_file.close()

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla bibendum nunc, at aliquet odio ultricies id. In hac habitasse platea dictumst. Fusce id mi lacinia eros pellentesque molestie nec ut quam. Aliquam erat volutpat. Aliquam erat volutpat. Aenean vitae mi quis mauris vestibulum euismod et egestas sapien. Morbi aliquam euismod metus, at sodales felis congue quis. Vestibulum bibendum lobortis arcu, vel mattis dui rutrum non. Etiam tempor mi non nibh consequat, ut venenatis sem dictum. Nullam sagittis libero vel felis iaculis varius sit amet fermentum diam. Donec at tortor non felis ultrices sollicitudin.



Nulla interdum dignissim urna, id auctor sapien. Integer in rhoncus arcu. Vestibulum at gravida velit. Ut pharetra ultricies elit sed consequat. Mauris laoreet eleifend auctor. Nulla scelerisque commodo dictum. Donec varius mauris eu mollis convallis. Phasellus id dictum mi. Etiam luctus mattis arcu, commodo maximus purus sodales vel. Nam justo ex, volutpat in eros eg

You can also read the text file as a whole which returns a string 


In [3]:
text_file = open(DATA_DIR + "test.txt", "r")
lines = text_file.read()
print(lines)
text_file.close()

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla bibendum nunc, at aliquet odio ultricies id. In hac habitasse platea dictumst. Fusce id mi lacinia eros pellentesque molestie nec ut quam. Aliquam erat volutpat. Aliquam erat volutpat. Aenean vitae mi quis mauris vestibulum euismod et egestas sapien. Morbi aliquam euismod metus, at sodales felis congue quis. Vestibulum bibendum lobortis arcu, vel mattis dui rutrum non. Etiam tempor mi non nibh consequat, ut venenatis sem dictum. Nullam sagittis libero vel felis iaculis varius sit amet fermentum diam. Donec at tortor non felis ultrices sollicitudin.

Nulla interdum dignissim urna, id auctor sapien. Integer in rhoncus arcu. Vestibulum at gravida velit. Ut pharetra ultricies elit sed consequat. Mauris laoreet eleifend auctor. Nulla scelerisque commodo dictum. Donec varius mauris eu mollis convallis. Phasellus id dictum mi. Etiam luctus mattis arcu, commodo maximus purus sodales vel. Nam justo ex, volutpat in eros eget

# JSON Files

JSON stands for JavaScript Object Notation. It is a file format usually used to exchange information on the web. The file extension commonly used in `.json`. 

If you observe the script below we are using the `with` statement. This helps us in our file handling by automatically closing the file being read after the `with` code block. This makes our code much cleaner. In essence, it is best practice that when we open a file we close after using it because it will be inaccessible to other resources. *Kaya closure is important*

In [4]:
import json
with open(DATA_DIR + "twitter.json", 'r') as json_file:
    json_data = json.load(json_file)
    display(json_data)

{'text': 'RT @PostGradProblem: In preparation for the NFL lockout, I will be spending twice as much time analyzing my fantasy baseball team during ...',
 'truncated': True,
 'in_reply_to_user_id': None,
 'in_reply_to_status_id': None,
 'favorited': False,
 'source': '<a href="http://twitter.com/" rel="nofollow">Twitter for iPhone</a>',
 'in_reply_to_screen_name': None,
 'in_reply_to_status_id_str': None,
 'id_str': '54691802283900928',
 'entities': {'user_mentions': [{'indices': [3, 19],
    'screen_name': 'PostGradProblem',
    'id_str': '271572434',
    'name': 'PostGradProblems',
    'id': 271572434}],
  'urls': [],
  'hashtags': []},
 'contributors': None,
 'retweeted': False,
 'in_reply_to_user_id_str': None,
 'place': None,
 'retweet_count': 4,
 'created_at': 'Sun Apr 03 23:48:36 +0000 2011',
 'retweeted_status': {'text': 'In preparation for the NFL lockout, I will be spending twice as much time analyzing my fantasy baseball team during company time. #PGP',
  'truncated': False,


If you are curious as to how the script would look like without the `with` statement, it should look like this.

In [5]:
json_file = open(DATA_DIR + "twitter.json", 'r')
json_data = json.load(json_file)
display(json_data)
json_file.close()

{'text': 'RT @PostGradProblem: In preparation for the NFL lockout, I will be spending twice as much time analyzing my fantasy baseball team during ...',
 'truncated': True,
 'in_reply_to_user_id': None,
 'in_reply_to_status_id': None,
 'favorited': False,
 'source': '<a href="http://twitter.com/" rel="nofollow">Twitter for iPhone</a>',
 'in_reply_to_screen_name': None,
 'in_reply_to_status_id_str': None,
 'id_str': '54691802283900928',
 'entities': {'user_mentions': [{'indices': [3, 19],
    'screen_name': 'PostGradProblem',
    'id_str': '271572434',
    'name': 'PostGradProblems',
    'id': 271572434}],
  'urls': [],
  'hashtags': []},
 'contributors': None,
 'retweeted': False,
 'in_reply_to_user_id_str': None,
 'place': None,
 'retweet_count': 4,
 'created_at': 'Sun Apr 03 23:48:36 +0000 2011',
 'retweeted_status': {'text': 'In preparation for the NFL lockout, I will be spending twice as much time analyzing my fantasy baseball team during company time. #PGP',
  'truncated': False,


Once this file has been loaded, python treats it like any dictionary. Therefore the syntax to extract information from the json file is similar to any python dictionary. 

In [6]:
print("Twitter Handle:" + json_data['user']['name'])
print("Tweet:" + json_data['text'])
print("Date Created:" + json_data['created_at'])
print("Retweet Count:" + str(json_data['retweet_count']))

Twitter Handle:GG
Tweet:RT @PostGradProblem: In preparation for the NFL lockout, I will be spending twice as much time analyzing my fantasy baseball team during ...
Date Created:Sun Apr 03 23:48:36 +0000 2011
Retweet Count:4


# Zip files

Zip is an archive file format with a lossless data compression. This means that the file is compressed to a smaller size but can be extracted back to its original size without any loss of information. The common file extension for this is `.zip`

In [7]:
from zipfile import ZipFile
with ZipFile(DATA_DIR + "titanic.zip", 'r') as zip_file:
    zip_file.printdir()
    
    #extract a specific file to a specified directory 
    zip_file.extract('gender_submission.csv', DATA_DIR)
    
    #extract all files to a specified directory
    zip_file.extractall(DATA_DIR)

File Name                                             Modified             Size
gender_submission.csv                          2019-12-11 02:17:12         3258
test.csv                                       2019-12-11 02:17:12        28629
train.csv                                      2019-12-11 02:17:12        61194


It is also possible to read files without extracting them.

In [8]:
with ZipFile(DATA_DIR + "titanic.zip", 'r') as zip_file:
    df = pd.read_csv(zip_file.open('test.csv'))
    display(df.head())

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [9]:
! ls "./data/"

gender_submission.csv test.txt              train.csv
test.csv              titanic.zip           twitter.json
