 # Data Wrangling with Python

 ## Advanced Data Structures and File Handling

 Explore Python's advanced data structures and applying them to solve problems. Look into the OS file-handling operations to manipulate data in files.

 ### Advanced Data Structures

 #### Iterator
 An Iterator is an object that implements the next methods, that allows it to iterate over a collection (lists, tuples, dicts, etc). It is a stateful object that tracks which element is return by the iterator. If the collection has no more element, it raises a StopIteration exception.

 Python loops use the iterator object for the `for` loops which makes it more `foreach` then the 3 distinct part version for `for loop`. The reason is the work done by the ` initiation, increment, and termination condition` is handled by the Iterator.

 ##### Introduction to the Iterator

In [1]:
# using comprehension to generate list of numbers
big_list_of_numbers = [ 1 for x in range(0, 10000000) ]

In [2]:
# check the size of variable
from sys import getsizeof

getsizeof(big_list_of_numbers)

81528056

In [3]:
from itertools import repeat

# using a iterator to create the list. The memory usage is minimal since it is not generating the entire list in memory
small_list_of_numbers = repeat(1, times=10000000)
getsizeof(small_list_of_numbers)

56

In [4]:
for i, x in enumerate(small_list_of_numbers):
    print(x)
    if i > 10:
        break

getsizeof(small_list_of_numbers)

1
1
1
1
1
1
1
1
1
1
1
1


56

In [5]:
# itertools definition
from itertools import (
    permutations, combinations, dropwhile, repeat, zip_longest
)

permutations?

In [6]:
combinations?

In [7]:
dropwhile?

In [8]:
repeat?

In [9]:
zip_longest?

 #### Stacks

 List with Last In First (LIFO) restriction, meaning an element that comes in last goes out first when a value is read.

 ##### Implementing a stack in Python

In [10]:
stack = []

# use append to add element to end of stack (list)
stack.append(25)
stack

[25]

In [11]:
# append another value
stack.append(-12)
stack

[25, -12]

In [12]:
# Read the value at top of stack using pop
tos = stack.pop()
tos

-12

In [13]:
# current stack value
stack

[25]

In [14]:
# append another value
stack.append('Hello')
stack

[25, 'Hello']

 ##### Implementing a stack using user-defined methods

In [15]:
def stack_push(s, value):
    return s + [value] # convert value to array and concatenate it to original array

def stack_pop(s):
    tos = s[-1]
    del s[-1]
    return tos

url_stack = []

In [16]:
wikipedia_datascience = 'Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge [https://en.wikipedia.org/wiki/Knowledge] and insights from data [https://en.wikipedia.org/wiki/Data] in various forms, both structured and unstructured,similar to data mining [https://en.wikipedia.org/wiki/Data_mining]'

In [17]:
len(wikipedia_datascience)

347

In [18]:
# split string into list
wd_list = wikipedia_datascience.split()
len(wd_list)

34

In [19]:
# push every url found into url_stack. for each url found it is slice to remove []
for word in wd_list:
    if word.startswith('[https://'):
        url_stack = stack_push(url_stack, word[1:-1])
        
url_stack

['https://en.wikipedia.org/wiki/Knowledge',
 'https://en.wikipedia.org/wiki/Data',
 'https://en.wikipedia.org/wiki/Data_mining']

In [20]:
for i in range(0, len(url_stack)):
    print(stack_pop(url_stack))

https://en.wikipedia.org/wiki/Data_mining
https://en.wikipedia.org/wiki/Data
https://en.wikipedia.org/wiki/Knowledge


In [21]:
print(url_stack)

[]


#### Lambda

##### Lambda Expression
Basic lambda example expressing the trigonometric identity

$$\begin{eqnarray}
\sin^2\phi &+& \cos^2\phi&=1 \\
\end{eqnarray}$$

In [22]:
import math

def my_sine():
    return lambda x: math.sin(math.radians(x)) # return is assumed in lambda function

def my_cosine():
    return lambda x: math.cos(math.radians(x)) # return is assumed in lambda function

sine = my_sine()
cosine = my_cosine()
math.pow(sine(30), 2) + math.pow(cosine(30), 2)

1.0

 ##### Lambda Expression for Sorting

In [23]:
capitals = [
    ('USA', 'Washington')
    , ('India', 'Delhi')
    , ('France', 'Paris')
    , ('UK', 'London')
]

capitals

[('USA', 'Washington'),
 ('India', 'Delhi'),
 ('France', 'Paris'),
 ('UK', 'London')]

In [24]:
capitals.sort(key=lambda item: item[1])

capitals

[('India', 'Delhi'),
 ('UK', 'London'),
 ('France', 'Paris'),
 ('USA', 'Washington')]

##### Multi-Element Membership Checking

In [25]:
list_of_words = [
    'Hello', 'there.', 'How', 'are', 'you', 'doing?'
]

check_for = [ 'How', 'are' ]

all(w in list_of_words for w in check_for)

True

In [26]:
def brute_all(check_list, word_list):
    for w in check_list:
        if w in word_list:
            continue
        else:
            return False
        
    return True

brute_all(check_for, list_of_words)

True

#### Queue
A queue similar in behavior to a stack except that the restriction on the list is FIFO (First In First Out).

##### Implementing a Queue

In [27]:
%%time
# Create the queue
queue = []

for i in range(0, 100000):
    queue.append(i)
    
print('Queue created')

Queue created
CPU times: user 15.8 ms, sys: 6.37 ms, total: 22.1 ms
Wall time: 33.3 ms


In [28]:
%%time
# Popping an element from the queue. This takes longer the Queue creation since each pop requires Python to rearrange all the elements to shift from right to left 
for i in range(0, 100000 - 1):
    queue.pop(0)
    
print('Queue emptied')

Queue emptied
CPU times: user 991 ms, sys: 4.52 ms, total: 996 ms
Wall time: 1.02 s


In [30]:
%%time
# Using deque this is an optimized version of a queue
from collections import deque

queue2 = deque()

for i in range(0, 100000):
    queue2.append(i)

print('Queue created')
    
for i in range(0, 100000):
    queue2.popleft()

print('Queue emptied')

Queue created
Queue emptied
CPU times: user 21.5 ms, sys: 1.06 ms, total: 22.6 ms
Wall time: 22.1 ms


##### Activity 3: Permutation, Iterator, Lambda, List

In [2]:
from itertools import dropwhile, permutations

# create tuple permutations
for permutation in permutations(range(3)):
    print(permutation)
    assert(isinstance(permutation, tuple))

(0, 1, 2)
(0, 2, 1)
(1, 0, 2)
(1, 2, 0)
(2, 0, 1)
(2, 1, 0)


In [8]:
# Remove preceding 0
for permutation in permutations(range(3)):
    permutation = list(dropwhile(lambda a: a <= 0, permutation))
    print(permutation)

[1, 2]
[2, 1]
[1, 0, 2]
[1, 2, 0]
[2, 0, 1]
[2, 1, 0]


In [11]:
import math

def convert_to_number(number_stack):
    result = 0
    for i in range(0, len(number_stack)):
        result += (number_stack.pop() * math.pow(10, i))
        
    return result

# Remove preceding 0
for permutation in permutations(range(3)):
    number_stack = list(dropwhile(lambda a: a <= 0, permutation))
    print(convert_to_number(number_stack))

12.0
21.0
102.0
120.0
201.0
210.0


### Operating system(OS)-level functions

System operating functions allow python to use the underlying OS API to perform functionality required of an application

#### Manipulating Environment variables

In [15]:
import os

# Setting, reading system environment variables
os.environ['MY_KEY'] = 'MY_VALUE'
os.getenv('MY_KEY')

'MY_VALUE'

In [17]:
# Print invalid environmental variable
print(os.getenv('MY_INVALID_KEY'))

# get all environments
os.environ

None


environ{'TERM_PROGRAM': 'iTerm.app',
        'NVM_CD_FLAGS': '',
        'TERM': 'xterm-color',
        'SHELL': '/bin/bash',
        'TMPDIR': '/var/folders/xx/sd0_fz6n7ylbhdz47b0ttjpr0000gn/T/',
        'CONDA_SHLVL': '2',
        'Apple_PubSub_Socket_Render': '/private/tmp/com.apple.launchd.K5bGSSqTjF/Render',
        'CONDA_PROMPT_MODIFIER': '(/Volumes/apfs_256gb/Personal/Technology/conda/data-engineering) ',
        'TERM_PROGRAM_VERSION': '3.2.9',
        'OLDPWD': '/Users/darrens',
        'TERM_SESSION_ID': 'w0t0p0:737E6163-6369-4B7A-9EFA-56CDF1834385',
        'NVM_DIR': '/Users/darrens/.nvm',
        'USER': 'darrens',
        'APFS_256GB_CONDA': '/Volumes/apfs_256gb/Personal/Technology/conda',
        'CONDA_EXE': '/miniconda3/bin/conda',
        'SSH_AUTH_SOCK': '/private/tmp/com.apple.launchd.tqCKBsVBYo/Listeners',
        '__CF_USER_TEXT_ENCODING': '0x0:0:0',
        '_CE_CONDA': '',
        'CONDA_PREFIX_1': '/miniconda3',
        'APFS_256GB_SOURCECONTROL': '/Volumes/ap

In [19]:
# Remove environment variable
os.environ.pop('MY_KEY')

'MY_VALUE'

#### Basic File Operations in Python

##### File Operations

Explore file manipulation functions
1. Read file line by line
2. Read file as a whole

##### File Handling

A file can be opened in Python using different modes which control what manipulation an application can perform to the file.

|Character|Meaning of the character|
|:-------:|:----------------------:|
|'r'|Opening for reading (default)|
|'w'|Opening for writing|
|'x'|Create a new file and open if for writing|
|'a'|Opening for writing in append mode, if it exists|
|'b'|Binary mode|
|'t'|Text mode (default)|
|'+'|Update mode (both write and read)|

> Note
- `b` binary mode - No decoding and the bytes object is return
- `t` Text mode - Decodes the string to return the textual representation

###### Opening and closing a file manually

In [24]:
# open and closing manually
fd = open('alices-adventures-in-wonderland-by-lewis-carroll.txt', 'rb')
print(fd)
print(f'fd.closed pre fd.close(): {fd.closed}')
fd.close()
print(f'fd.closed post fd.close(): {fd.closed}')

<_io.BufferedReader name='alices-adventures-in-wonderland-by-lewis-carroll.txt'>
fd.closed pre fd.close(): False
fd.closed post fd.close(): True


###### Opening and closing a file using context manager

Python provides the compound statement `with` that wraps the code block in the context manager. The statement will automatically clean up resources like closing file handles if supported.

In [23]:
# open and closing using context manager
with open('alices-adventures-in-wonderland-by-lewis-carroll.txt', 'rb') as fd:
    print(fd)
    print(f'fd.closed in context manager: {fd.closed}')
    
print(f'fd.closed outside context manager: {fd.closed}')

<_io.BufferedReader name='alices-adventures-in-wonderland-by-lewis-carroll.txt'>
fd.closed in scope manager: False
fd.closed outside scope manager: True


###### Reading a file line by line

This functionality allows Python to process large files since it will not run out of memory processing the file line by line

In [27]:
with open('alices-adventures-in-wonderland-by-lewis-carroll.txt', encoding='utf8') as fd:
    for line in fd:
        print(line)  

﻿Project Gutenberg’s Alice’s Adventures in Wonderland, by Lewis Carroll



This eBook is for the use of anyone anywhere at no cost and with

almost no restrictions whatsoever.  You may copy it, give it away or

re-use it under the terms of the Project Gutenberg License included

with this eBook or online at www.gutenberg.org





Title: Alice’s Adventures in Wonderland



Author: Lewis Carroll



Posting Date: June 25, 2008 [EBook #11]

Release Date: March, 1994

Last Updated: October 6, 2016



Language: English



Character set encoding: UTF-8



*** START OF THIS PROJECT GUTENBERG EBOOK ALICE’S ADVENTURES IN WONDERLAND ***





















ALICE’S ADVENTURES IN WONDERLAND



Lewis Carroll



THE MILLENNIUM FULCRUM EDITION 3.0









CHAPTER I. Down the Rabbit-Hole



Alice was beginning to get very tired of sitting by her sister on the

bank, and of having nothing to do: once or twice she had peeped into the

book her sister was reading, but it had no pictures or conversations


The Gryphon lifted up both its paws in surprise. ‘What! Never heard of

uglifying!’ it exclaimed. ‘You know what to beautify is, I suppose?’



‘Yes,’ said Alice doubtfully: ‘it means--to--make--anything--prettier.’



‘Well, then,’ the Gryphon went on, ‘if you don’t know what to uglify is,

you ARE a simpleton.’



Alice did not feel encouraged to ask any more questions about it, so she

turned to the Mock Turtle, and said ‘What else had you to learn?’



‘Well, there was Mystery,’ the Mock Turtle replied, counting off

the subjects on his flappers, ‘--Mystery, ancient and modern, with

Seaography: then Drawling--the Drawling-master was an old conger-eel,

that used to come once a week: HE taught us Drawling, Stretching, and

Fainting in Coils.’



‘What was THAT like?’ said Alice.



‘Well, I can’t show it you myself,’ the Mock Turtle said: ‘I’m too

stiff. And the Gryphon never learnt it.’



‘Hadn’t time,’ said the Gryphon: ‘I went to the Classics master, though.

He was an old cr

###### Writing to file

In [43]:
data_dict = {
    'India': 'Delhi',
    'France': 'Paris',
    'UK': 'London',
    'USA': 'Washington'
}

# write the file
with open('data_temporary_files.txt', 'w') as fd:
    for country, capital in data_dict.items():
        fd.write(f'The capital of {country} is {capital}\n')
        
# read the file
with open('data_temporary_files.txt', 'r') as fd:
    for line in fd:
        print(line)

The capital of India is Delhi

The capital of France is Paris

The capital of UK is London

The capital of USA is Washington



###### Appending to file

In [44]:
data_dict2 = {
    'China': 'Beijing',
    'Japan': 'Tokyo'
}

# append to file
with open('data_temporary_files.txt', 'a') as fd:
    for country, capital in data_dict2.items():
        print(f'The capital of {country} is {capital}', file=fd) # print adds a \n automatically

# read the file
with open('data_temporary_files.txt', 'r') as fd:
    for line in fd:
        print(line)        

The capital of India is Delhi

The capital of France is Paris

The capital of UK is London

The capital of USA is Washington

The capital of China is Beijing

The capital of Japan is Tokyo



##### Activity 4: Design CSV Parser

In [58]:
from itertools import zip_longest

def convert_to_json(header, line):
    return {
       item[0]: item[1] for item in zip_longest(header, line, fillvalue=None)
    }

with open('sales_record.csv', 'rt') as fd:
    header = fd.readline().replace("\n", "").split(',')
    result = [convert_to_json(header, line.replace("\n", "").split(',')) for line in fd]
result

[{'Region': 'Central America and the Caribbean',
  'Country': 'Antigua and Barbuda ',
  'Item Type': 'Baby Food',
  'Sales Channel': 'Online',
  'Order Priority': 'M',
  'Order Date': '12/20/2013',
  'Order ID': '957081544',
  'Ship Date': '1/11/2014',
  'Units Sold': '552',
  'Unit Price': '255.28',
  'Unit Cost': '159.42',
  'Total Revenue': '140914.56',
  'Total Cost': '87999.84',
  'Total Profit': '52914.72'},
 {'Region': 'Central America and the Caribbean',
  'Country': 'Panama',
  'Item Type': 'Snacks',
  'Sales Channel': 'Offline',
  'Order Priority': 'C',
  'Order Date': '7/5/2010',
  'Order ID': '301644504',
  'Ship Date': '7/26/2010',
  'Units Sold': '2167',
  'Unit Price': '152.58',
  'Unit Cost': '97.44',
  'Total Revenue': '330640.86',
  'Total Cost': '211152.48',
  'Total Profit': '119488.38'},
 {'Region': 'Europe',
  'Country': 'Czech Republic',
  'Item Type': 'Beverages',
  'Sales Channel': 'Offline',
  'Order Priority': 'C',
  'Order Date': '9/12/2011',
  'Order ID': '