### <div id="py"> Working with different file formats </div>



- JSON (java script object notation)
- CSV (Command Seperated Values)
- Excel
- Avro


### Data comes in various forms

<img src="images/data_gen.jpeg">

As a data person you will deal with various type of data and it's important to learn how to handle these file formats 



## Working in JSON files

***
Since its inception, JSON has quickly become the de facto standard for information exchange. 

Chances are you’re here because you need to transport some data from here to there. Perhaps you’re gathering information through an API or storing your data in a document database. 

One way or another, you’re up to your neck in JSON, and you’ve got to Python your way out.








## A (Very) Brief History of JSON

JSON stangs for JavaScript Object Notation was inspired by a subset of the JavaScript programming language dealing with object literal syntax. 

Ultimately, the community at large adopted JSON because it’s easy for both humans and machines to create and understand.

### Look, it’s JSON!
```
{
    "firstName": "Jane",
    "lastName": "Doe",
    "hobbies": ["running", "sky diving", "singing"],
    "age": 35,
    "children": [
        {
            "firstName": "Alice",
            "age": 6
        },
        {
            "firstName": "Bob",
            "age": 8
        }
    ]
}
```

### Does this look similar to something?

YES! Python **dictionary!**

### Writing JSON files

In [21]:
import json

In [22]:
data = {
    "president": {
        "name": "Zaphod Beeblebrox",
        "species": "Betelgeusian"
    }
}

In [23]:
with open("data_file.json", "w") as write_file:
    json.dump(data, write_file)

 Note that `dump()` takes two positional arguments:
 1. the data object to be serialized, and
 2. the file-like object to which the bytes will be written.

### Reading JSON files

In [24]:
with open("data_file.json", "r") as read_file:
    data = json.load(read_file)

In [None]:
type(data)

In [None]:
data

### You can also read JSON as DataFrame in Pandas

In [None]:
import pandas as pd

jsonStr = '''{"Index0":{"Courses": "Pandas","Discount": "1200"},
           "Index1":{"Courses": "Hadoop","Discount": "1500"},
           "Index2":{"Courses": "Spark","Discount": "1800"}
          }'''

# Convert JSON to DataFrame Using read_json()
df2 = pd.read_json(jsonStr, orient ='index')
print(df2)

### Convert Dict To DF

In [None]:
data['president']

In [33]:
import pandas as pd

df3 = pd.DataFrame.from_dict(data, orient ='index')

In [None]:
df3

## Working with CSV files

A CSV file (Comma Separated Values file) is a type of plain text file that uses specific structuring to arrange tabular data. 

It’s a plain text file that has data separated by commas!

```
column 1 name,column 2 name, column 3 name
first row data 1,first row data 2,first row data 3
second row data 1,second row data 2,second row data 3
...
```

In [35]:
df = pd.read_csv('data/hrdata.csv', index_col='Name')

In [None]:
df.head()

In [37]:
df = pd.read_csv('data/hrdata.csv', index_col='Name', parse_dates=['Hire Date'])

In [None]:
df

In [None]:
df.to_csv('data/hrdata_modified.csv')`

## Working with Excel Files

Excel spreadsheets are one of those things you might have to deal with at some point. Either it’s because your boss loves them or because marketing needs them, and you might have to learn how to work with spreadsheets.

Many companies still prefer using Excel files for their data storage and analysis, as a data expert you should know how to handle these files programatically!

To work with Excel files we have package in python `openpyxl`

In [None]:
pip install openpyxl

### Basics of Excel

<img src="images/excel.png" width=550>

In [40]:
from openpyxl import Workbook

workbook = Workbook()
sheet = workbook.active

sheet["A1"] = "hello"
sheet["B1"] = "world!"

workbook.save(filename="hello_world.xlsx")

In [None]:
#Reading excel file

from openpyxl import load_workbook
workbook = load_workbook(filename="data/sample-xlsx-file.xlsx")
workbook.sheetnames
['Sheet 1']


In [42]:
sheet = workbook.active

In [None]:
sheet

In [None]:
sheet.title

In [None]:
sheet["A1"]

In [None]:
sheet["A2"].value

In [None]:
sheet.cell(row=10, column=6)

In [None]:
sheet.cell(row=3, column=3).value

In [None]:
sheet["A1:C2"]

In [None]:
for row in sheet.iter_rows(values_only=True):
    print(row)

### You can read Excel file as DataFrame using Pandas

In [52]:
excel_df = pd.read_excel('data/sample-xlsx-file.xlsx')

In [None]:
excel_df

In [66]:
excel_df.to_excel('data/sample-xlsx-file-modifeid.xlsx')

## Working with AVRO

Apache Avro is a data serialization format. We can store data as `.avro` files on disk. 

Avro files are typically used with Spark but Spark is completely independent of Avro.

Avro is a row-based format that is suitable for evolving data schemas. One benefit of using Avro is that schema and metadata travels with the data.

If you have an .avro file, you have the schema of the data as well. 

The Apache Avro Specification provides easy-to-read yet detailed information.

In [None]:
pip install avro-python3

In [16]:
# Python 3 with `avro-python3` package available
import copy
import json
import avro
from avro.datafile import DataFileWriter, DataFileReader
from avro.io import DatumWriter, DatumReader

In [55]:

# Note that we combined namespace and name to get "full name"
schema = {
    'name': 'avro.example.User',
    'type': 'record',
    'fields': [
        {'name': 'name', 'type': 'string'},
        {'name': 'age', 'type': 'int'}
    ]
}

# Parse the schema so we can use it to write the data
schema_parsed = avro.schema.Parse(json.dumps(schema))

In [None]:
schema_parsed

In [57]:

# Write data to an avro file
with open('users.avro', 'wb') as f:
    writer = DataFileWriter(f, DatumWriter(), schema_parsed)
    writer.append({'name': 'Pierre-Simon Laplace', 'age': 77})
    writer.append({'name': 'John von Neumann', 'age': 53})
    writer.close()

In [None]:

# Read data from an avro file
with open('users.avro', 'rb') as f:
    reader = DataFileReader(f, DatumReader())
    metadata = copy.deepcopy(reader.meta)
    schema_from_file = json.loads(metadata['avro.schema'])
    users = [user for user in reader]
    reader.close()

print(f'Schema that we specified:\n {schema}')
print(f'Schema that we parsed:\n {schema_parsed}')
print(f'Schema from users.avro file:\n {schema_from_file}')
print(f'Users:\n {users}')

### Reading Avro Using Pandas

Avro format simply requires a schema and a list of records. We don’t need a dataframe to handle Avro files. 

However, we can write a `pandas` dataframe into an Avro file or read an Avro file into a `pandas` dataframe. 

To begin with, we can always represent a dataframe as a list of records and vice-versa

In [None]:
pip install pandavro

In [77]:
import copy
import json
import pandas as pd
import pandavro as pdx
from avro.datafile import DataFileReader
from avro.io import DatumReader

In [None]:
# Data to be saved
users = [{'name': 'Pierre-Simon Laplace', 'age': 77},
         {'name': 'John von Neumann', 'age': 53}]
users_df = pd.DataFrame.from_records(users)
print(users_df)

In [79]:
pdx.to_avro('data/users_test.avro', users_df)

In [None]:
# Read the data back
users_df_redux = pdx.from_avro('data/users_test.avro')
print(type(users_df_redux))
# <class 'pandas.core.frame.DataFrame'>


In [None]:
# Check the schema for "users.avro"
with open('users.avro', 'rb') as f:
    reader = DataFileReader(f, DatumReader())
    metadata = copy.deepcopy(reader.meta)
    schema_from_file = json.loads(metadata['avro.schema'])
    reader.close()
print(schema_from_file)