<hr>
<div style="background-color: lightgray; padding: 20px; color: black;">
<div>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/97/Coursera-Logo_600x600.svg/1024px-Coursera-Logo_600x600.svg.png" style="float: right; margin-right: 30px;" width="120"/> 
<font size="6.5" color="#0056D2"><b>Composing File and Data Solutions</b></font> <br>
<font size="5.5" color="#0056D2"><b>Working with Data in Python </b></font> 
</div>
<div style="text-align: left">  <br>
Edison David Serrano Cárdenas. <br>
MSc in Applied Mathematics <br>
CIMAT - Sede Guanajuato <br>
</div>

</div>
<hr>

##  <font color="#0056D2" >**Objetives**</font> 
In this module, you will learn how to effectively use Python’s data structures to load, persist, and iterate over data. You will apply these data structures to solve different problems when working with popular data formats like JSON.

Load Packages:

In [22]:
import os
import pandas as pd
import json

# <font color="#0056D2" >**Exploring Data Structures in Python**</font> 

<font color="#0056D2" >**Using Lists to Save and Retrieve Data in Python**</font> 

In [12]:
list_names = ["Pablo","Oscar"]
list_names.insert(0,"David")
print("List after insert:\t ",list_names)
print("Pablo index in list:\t ",list_names.index('Pablo'))

directories = os.listdir('..')
print("Files in main folder:\t ",directories)

List after insert:	  ['David', 'Pablo', 'Oscar']
Pablo index in list:	  1
Files in main folder:	  ['week1', '.git', 'README.md', 'LICENSE']


Using index with a non-existent name generate a ValueError

In [13]:
list_names.index("Alex")

ValueError: 'Alex' is not in list

<font color="#0056D2" >**Using Dictionaries to Save and Retrieve Data in Python**</font> 

In [None]:
contacts = {"name": "Alfredo", "lastname": "Deza"}
contacts.get("phone","Unkown")


'Unkown'

In [40]:
try:
  contacts['John']
except KeyError:
  print("Peter")

Peter


In [16]:
contacts.keys(), contacts.values()

(dict_keys(['name', 'lastname']), dict_values(['Alfredo', 'Deza']))

In [17]:
contacts["phone"]= "678-600-1111"
print(contacts)

{'name': 'Alfredo', 'lastname': 'Deza', 'phone': '678-600-1111'}


<font color="#0056D2" >**Overview of Less Common Data Structures in Python**</font> 

In [24]:
unique = set()
unique.add(4)
unique.add(1)
s = unique.pop()
print(unique, s)

{4} 1


<font color="#0056D2" >**Overview of Less Common Data Structures in Python**</font> 

In [27]:
contacts = {"Alfredo": "alfredo@example.org", "Kennedy": "kennedy@example.org", "Noah": "noah@example.org"}
for name, email in contacts. items():
    print(name, email)


Alfredo alfredo@example.org
Kennedy kennedy@example.org
Noah noah@example.org


<font color="#0056D2" >**Storing Data Between Data Structures in Python**</font> 



In [32]:
home_items = os.listdir('/workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/')
home_content = {"files":[],"directories":[]}

home_paths = [os.path.join('/workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/',item) for item in home_items]

for path in home_paths:
    if os.path.isdir(path):
        home_content['directories'].append(path)
    if os.path.isfile(path):
        home_content['files'].append(path)

print(home_content)

{'files': ['/workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/README.md', '/workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/LICENSE'], 'directories': ['/workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/week1', '/workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/.git']}


In [34]:
for item in home_content['files']:
    print(item)

/workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/README.md
/workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/LICENSE


<font color="#0056D2" >**Walking the filesystem, inspecting files**</font> 




In [None]:
# yields the 'current' dir, then the directories, and then any files it finds
# for each level it traverses
for path_info in os.walk('..'):
    print(path_info)
    break
    

('..', ['week1', '.git'], ['README.md', 'LICENSE'])


In [47]:
import os
from os.path import abspath, join


# producing absolute paths, instead of a tuple of three items
for top_dir, directories, files in os.walk('..'):
    for directory in directories:
        print(abspath(join(top_dir, directory)))
    print("\n")
    for _file in files:
        print(abspath(join(top_dir, _file)))
    break

/workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/week1
/workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/.git


/workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/README.md
/workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/LICENSE


In [49]:
# Now that absolute paths are shown, we can inspect them for file metadata

import os
from os.path import abspath, join, getsize

sizes = {}

for top_dir, directories, files in os.walk('..'):
    for _file in files:
        full_path = abspath(join(top_dir, _file))
        size = getsize(full_path)
        sizes[full_path] = size
        #break

sorted_results = sorted(sizes, key=sizes.get, reverse=True)


for path in sorted_results[:10]:
    print("Path: {0}, size: {1}".format(path, sizes[path]))

Path: /workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/LICENSE, size: 35149
Path: /workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/.git/objects/pack/pack-d85bdce49633cef446c881a006743750d3371cfb.pack, size: 14347
Path: /workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/week1/notes_week1.ipynb, size: 11559
Path: /workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/.git/hooks/pre-rebase.sample, size: 4898
Path: /workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/.git/hooks/fsmonitor-watchman.sample, size: 4726
Path: /workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/.git/hooks/update.sample, size: 3650
Path: /workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/README.md, size: 3501
Path: /workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/.git/hooks/push-to-checkout.sample, size: 2783
Path: /workspaces/Scripting-with-Python-and-SQL-for-Data-Engineering/.git/hooks/sendemail-validate.sample, size:

## <font color="#0056D2" >**Introduction to Data Sources and Formats in Python**</font> 

<font color="#0056D2" >**Loading Data from Files and File Paths in Python**</font> 

If I run these, you will read everything and then I can see what sequel contents has. 

In [16]:
sql_file = open("./data/populate.sql")
sql_contents = sql_file.read()
print(sql_contents)
sql_file.close()

INSERT INTO ratings(name, rating, region) VALUES("Laurenz V Charming Gruner Veltliner 2013", "90.0", "Kamptal, Austria");
INSERT INTO ratings(name, rating, region) VALUES("Laurenz V Charming Gruner Veltliner 2014", "90.0", "Kamptal, Austria");
INSERT INTO ratings(name, rating, region) VALUES("Laurenz V Singing Gruner Veltliner 2007", "90.0", "Austria");
INSERT INTO ratings(name, rating, region) VALUES("Laurenz V Singing Gruner Veltliner 2010", "88.0", "Austria");
INSERT INTO ratings(name, rating, region) VALUES("Laurenz V Singing Gruner Veltliner 2011", "88.0", "Austria");
INSERT INTO ratings(name, rating, region) VALUES("Laurenz V Singing Gruner Veltliner 2013", "89.0", "Austria");
INSERT INTO ratings(name, rating, region) VALUES("Lava Cap American River Red", "90.0", "El Dorado, Sierra Foothills, California");
INSERT INTO ratings(name, rating, region) VALUES("Lava Cap Barbera 2010", "90.0", "Sierra Foothills, California");
INSERT INTO ratings(name, rating, region) VALUES("Lava Cap Ba

 If you do, and use read lines, what happens is that you, instead of getting a single string, what you will get is a list.

In [17]:
sql_file = open("./data/populate.sql")
sql_contents = sql_file.readlines()
print(sql_contents)
sql_file.close()

['INSERT INTO ratings(name, rating, region) VALUES("Laurenz V Charming Gruner Veltliner 2013", "90.0", "Kamptal, Austria");\n', 'INSERT INTO ratings(name, rating, region) VALUES("Laurenz V Charming Gruner Veltliner 2014", "90.0", "Kamptal, Austria");\n', 'INSERT INTO ratings(name, rating, region) VALUES("Laurenz V Singing Gruner Veltliner 2007", "90.0", "Austria");\n', 'INSERT INTO ratings(name, rating, region) VALUES("Laurenz V Singing Gruner Veltliner 2010", "88.0", "Austria");\n', 'INSERT INTO ratings(name, rating, region) VALUES("Laurenz V Singing Gruner Veltliner 2011", "88.0", "Austria");\n', 'INSERT INTO ratings(name, rating, region) VALUES("Laurenz V Singing Gruner Veltliner 2013", "89.0", "Austria");\n', 'INSERT INTO ratings(name, rating, region) VALUES("Lava Cap American River Red", "90.0", "El Dorado, Sierra Foothills, California");\n', 'INSERT INTO ratings(name, rating, region) VALUES("Lava Cap Barbera 2010", "90.0", "Sierra Foothills, California");\n', 'INSERT INTO ratings

<div class="alert alert-block alert-info">
<b>Note:</b> When you're done processing when you're done loading the information from a file, you need to make sure you are closing the file. So in that case, how we do sql file that close and call that. And that would mean that I'm no longer have and open file descriptor in the server. </div>



In [18]:
with open("./data/populate.sql" ) as sql_file:
    sql_contents = sql_file.readlines()

In [19]:
sql_contents[:5]

['INSERT INTO ratings(name, rating, region) VALUES("Laurenz V Charming Gruner Veltliner 2013", "90.0", "Kamptal, Austria");\n',
 'INSERT INTO ratings(name, rating, region) VALUES("Laurenz V Charming Gruner Veltliner 2014", "90.0", "Kamptal, Austria");\n',
 'INSERT INTO ratings(name, rating, region) VALUES("Laurenz V Singing Gruner Veltliner 2007", "90.0", "Austria");\n',
 'INSERT INTO ratings(name, rating, region) VALUES("Laurenz V Singing Gruner Veltliner 2010", "88.0", "Austria");\n',
 'INSERT INTO ratings(name, rating, region) VALUES("Laurenz V Singing Gruner Veltliner 2011", "88.0", "Austria");\n']

In [21]:
df = pd.read_csv("./data/wine-ratings-small.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,name,grape,region,variety,rating,notes
0,0,Laurenz V Charming Gruner Veltliner 2013,,"Kamptal, Austria",White Wine,90.0,Aromas of ripe apples and a typical Veltliner ...
1,1,Laurenz V Charming Gruner Veltliner 2014,,"Kamptal, Austria",White Wine,90.0,Aromas of ripe apples and a typical Veltliner ...
2,2,Laurenz V Singing Gruner Veltliner 2007,,Austria,White Wine,90.0,"A very attractive fruit bouquet yields apple, ..."
3,3,Laurenz V Singing Gruner Veltliner 2010,,Austria,White Wine,88.0,"A very attractive fruit bouquet yields apple, ..."
4,4,Laurenz V Singing Gruner Veltliner 2011,,Austria,White Wine,88.0,"A very attractive fruit bouquet yields apple, ..."


<font color="#0056D2" >**Working with JSON in Python**</font> 

In [24]:
data = {"name":"Alfredo","data": [], "valid":True}
json.dumps(data)

'{"name": "Alfredo", "data": [], "valid": true}'

In [25]:
json_output = '{"name": "Alfredo", "data": 1, "valid": true}'
loaded_json = json.loads(json_output)
type(loaded_json)

dict

In [26]:
with open("./data/wine-ratings.json") as f:
    loaded_json = json.load(f)

In [27]:
loaded_json

{'name': {'0': 'Laurenz V Charming Gruner Veltliner 2013',
  '1': 'Laurenz V Charming Gruner Veltliner 2014',
  '2': 'Laurenz V Singing Gruner Veltliner 2007',
  '3': 'Laurenz V Singing Gruner Veltliner 2010',
  '4': 'Laurenz V Singing Gruner Veltliner 2011',
  '5': 'Laurenz V Singing Gruner Veltliner 2013',
  '6': 'Lava Cap American River Red',
  '7': 'Lava Cap Barbera 2010',
  '8': 'Lava Cap Battonage Chardonnay 2012',
  '9': 'Lava Cap Cabernet Sauvignon 2013',
  '10': 'Lava Cap Cabernet Sauvignon 2016',
  '11': 'Lava Cap Petite Sirah 2013',
  '12': 'Lava Cap Petite Sirah 2014',
  '13': 'Lava Cap Petite Sirah 2016',
  '14': 'Lava Cap Reserve Chardonnay 2015',
  '15': 'Lava Cap Reserve Chardonnay 2018',
  '16': 'Lava Cap Reserve Chardonnay 2016',
  '17': 'Lava Cap Reserve Merlot 2015',
  '18': 'Lava Cap Sauvignon Blanc 2015',
  '19': 'Lava Cap Sauvignon Blanc 2017',
  '20': 'Lava Cap Syrah 2009',
  '21': 'Lava Cap Syrah 2014',
  '22': 'Lava Cap Syrah 2013',
  '23': 'Lava Vine Winery K

<font color="#0056D2" >**Saving Data from Python to Disk**</font> 

In [32]:
data = {"name":"Alfredo","lastname":"Deza","valid":True}

In [33]:
with open("./data/sample_data.json","w") as f:
    json.dump(data,f)

In [34]:
data = {"grape": "Cabernet Franc", "species": "Vitis vinifera", "origin": "Bordeaux, France"}
# Convert Python data to JSON. The `.dumps()` method takes a data structure as input and provides a JSON string as output
# mnemonic: dumps -> DUMP to String
json.dumps(data)

'{"grape": "Cabernet Franc", "species": "Vitis vinifera", "origin": "Bordeaux, France"}'

In [35]:
json_data = json.dumps(data)
# Now load it into Python
# mnemonic: loads -> LOAD from String
json.loads(json_data)


{'grape': 'Cabernet Franc',
 'species': 'Vitis vinifera',
 'origin': 'Bordeaux, France'}

In [37]:
collection = [data, data]
print(collection)
# may look similar in the output, but the difference is that JSON is now a string
json.dumps(collection)

[{'grape': 'Cabernet Franc', 'species': 'Vitis vinifera', 'origin': 'Bordeaux, France'}, {'grape': 'Cabernet Franc', 'species': 'Vitis vinifera', 'origin': 'Bordeaux, France'}]


'[{"grape": "Cabernet Franc", "species": "Vitis vinifera", "origin": "Bordeaux, France"}, {"grape": "Cabernet Franc", "species": "Vitis vinifera", "origin": "Bordeaux, France"}]'

In [38]:
# define a nested data structure in a single line
grape_data = {"name": "Cabernet France", "regions": [{"country": "France", "sub-regions": ["Bordeaux", "Loire Valley"]},{"country": "Italy", "sub-regions": ["Apulia", "Tuscany"]}, {"country": "Argentina", "sub-regions": ["Mendoza", "Lujan de Cuyo", "Salta"]}]} 
# Serialize the Python dictionary to a JSON string, but using extra formatting options, like sorted keys
# and using 4 spaces for indentation
data_as_json = json.dumps(grape_data, sort_keys=True, indent=4)
print(data_as_json)

{
    "name": "Cabernet France",
    "regions": [
        {
            "country": "France",
            "sub-regions": [
                "Bordeaux",
                "Loire Valley"
            ]
        },
        {
            "country": "Italy",
            "sub-regions": [
                "Apulia",
                "Tuscany"
            ]
        },
        {
            "country": "Argentina",
            "sub-regions": [
                "Mendoza",
                "Lujan de Cuyo",
                "Salta"
            ]
        }
    ]
}


# <font color="#0056D2" >**Build a useful Python Decorator**</font> 


In [50]:
# The parent function is going to be the decorator
def parent(func):
    print(f"Function name is: {func.__name__}")
    return func

In [None]:
def  main():
    print("This is the main function runing!")

Function name is: main


In [53]:
lazy_main = parent(main)
lazy_main()

Function name is: main
This is the main function runing!


<font color="#0056D2" >**Python Decorator**</font> 


In [58]:
def parent():
    def decorate(func):
        print(f"Function name is: {func.__name__}")
        return func
    return decorate

In [60]:
@parent()
def  main():
    print("This is the main function runing!")

Function name is: main
