# Practical 3A: Python packages and data 

Upon completion of this session you should be able to:

   - use Python Packages to manipulate the data and files

---
- Materials in this module include resources collected from various open-source online repositories.
- Jupyter source file can be downloaded from clouddeakin SIT384 > weekly resources or https://github.com/gaoshangdeakin/SIT384-Jupyter
- If you found any issue/bug for this document, please submit an issue at [https://github.com/gaoshangdeakin/SIT384/issues](https://github.com/gaoshangdeakin/SIT384/issues)




## Content



### Part 1 Python packages

1.1 [Standard Libary](#standlib)

1.2 [Third Party Packages](#3rdparty)

1.3 [Importing a module](#importmod) 


### Part 2 Python Simple IO

2.1 [Input](#input)

2.2 [Output](#output)


### Part 3 Datetime Module

3.1 [Time](#time)

3.2 [Date](#date)

3.3 [Timedelta](#timedelta)

3.4 [Formatting and Parsing](#parsing)

### Part 4 Numpy Module

4.1 [Importing Numpy](#importnp)

4.2 [Numpy arrays](#nparray)

4.3 [Manipulating arrays](#maninp)

4.4 [Array Operations](#arrayop)

4.5 [np.random](#random)


### Part 5 Data Loading

5.1 [TXT](#txt)

5.2 [CSV](#csv)

5.3 [JSON](#json)



---
## <span style="color:#0b486b">1. Python packages</span>

After completing previous Python sessions, you should know about the syntax and semantics of the Python language. But apart from that, you should also learn about Python libraries and its packages to be able to code efficiently. Python’s standard library is very extensive, offering a wide range of facilities as indicated [here](https://docs.python.org/3.7/library/). The library contains built-in modules (written in C) that provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers, as well as modules written in Python that provide standardized solutions for many problems that occur in everyday programming. Look at the [Python Standard Library Manual](https://docs.python.org/3.7/library/) to read more.

In addition to the standard library, there is a growing collection of several thousand components (from individual programs and modules to packages and entire application development frameworks), available from the [Python Package Index](https://pypi.python.org/pypi).

<a id = "standlib"></a>

### <span style="color:#0b486b">1.1 Standard libraries</span>

For a complete list of Python standard library and their documentation look at the [Python Manual.](https://docs.python.org/3.7/library/) A few to mention are:

* ``math`` for numeric and math-related functions and data types
* ``urllib`` for fetching data across the web
* ``datetime`` for manipulating dates and times
* ``pickle`` and ``cPickle`` for serializing and deserializing data structures enabling us to save our variables on the disk and load them from the disk
* ``os`` for os dependent functions

<a id = "3rdparty"></a>

### <span style="color:#0b486b">1.2 Third party packages</span>

There are thousands of third party packages, each developed for a special task. Some of the useful libraries for data science are:

* ``numpy`` is probably the most fundamental package for efficient scientific computing in Python
* ``scipy`` is one of the core packages for scientific computation
* ``pandas`` is a library for operating with table-like data structures called DataFrame object
* ``matplotlib`` is a comprehensive plotting library
* ``BeautifulSoup`` is an HTML and XML parser
* ``scikit-learn`` is the most general machine learning library for Python
* ``nltk`` is a toolkit for natural language processing

---
<a id = "importmod"></a>
### <span style="color:#0b486b">1.3 Importing a module</span>

To use a module, first you have to ``import`` it. There are different ways to import a module:

* `import my_module`
* `from my_module import my_function`
* `from my_module import my_function as func`
* `from my_module import submodule`
* `from my_module import submodule as sub`
* `from my_module import *`

**`'import my_module'`** imports the module `'my_module'` and creates a reference to it in the namespace. For example `'import math'` imports the module `'math'` into the namespace. After importing the module this way, you can use the dot operator `(.)` to refer to the objects defined in the module. For example `'math.exp()'` refers to function `'exp()'` in module `'math'`.

In [None]:
import math

x = 2
y1 = math.exp(x)
y2 = math.log(x)

print("e^{} is {} and log({}) is {}".format(x, y1, x, y2))

**`'from my_module import my_function'`** only imports the function `'my_function'` from the module `'my_module'` into the namespace. This way you won't have access to neither the module (since you have not imported the module), nor the other objects of the module. You can only have access to the object you have imported.

You can use a comma to import multiple objects.

In [None]:
from math import exp

x = 2
y = exp(x)  # no need to math.exp()

print("e^{} is {}".format(x, y))

**`'from my_module import my_function as func'`** imports the function `'my_function'` from module `'my_module'` but its identifier in the namespace is changed into `'func'`. This syntax is used to import submodules of a module as well. For example later you will see that nowadays it is almost a convention to import matplotlib.pyplot as plt.

In [None]:
# you can change the name of the imported object
from math import exp as myfun

x = 2
y = myfun(x)

print("e^{} is {}".format(x, y))

**`'from my_module import *'`** imports all the public objects defined in `'my_module'` into the namespace. Therefore after this statement you can simply use the plain name of the object to refer to it and there is no need to use the dot operator:

In [None]:
from math import *

x = 2
y1 = exp(x)
y2 = log(x)

print("e^{} is {} and log({}) is {}".format(x, y1, x, y2))


## <span style="color:#0b486b">2. Python simple input/output</span>

Python uses input() and print() to do input and output, which have been introduced in previous two pracs. 

<a id = "input"></a>

### <span style="color:#0b486b">2.1 Input</span>

`input()` asks the user for a string of data (ended with a newline), and simply returns the string.

<a id = "output"></a>

### <span style="color:#0b486b">2.2 output</span>

`print()` is the basic way to do output. To print multiple things on the same line separated by spaces, use commas between them.

Objects can be printed on the same line using the 'end' arguments. You can read the [print()](https://docs.python.org/3/library/functions.html#print) syntax.

In [None]:
print('Sample is using the end=\'\,\'')
for i in range(10):
    print(i, end=',')    
print('\nSample is using the end=\' \'')   
for i in range(10):    
    print(i, end=' ')
print('\nSample is without the end arguments') 
for i in range(10):    
    print(i)

---
## <span style="color:#0b486b">3. datetime module</span>


The datetime module includes functions and classes for date and time parsing, formatting, and arithmetic.

<a id = "time"></a>

### <span style="color:#0b486b">3.1 Time</span>

Time values are represented with the time class. Times have attributes for hour, minute, second, and microsecond. They can also include time zone information.

In [None]:
import datetime

t = datetime.time(11, 21, 33)
print(t)
print('hour  :', t.hour)
print('minute:', t.minute)
print('second:', t.second)
print('microsecond:', t.microsecond)
print('tzinfo:', t.tzinfo)

<a id = "date"></a>

### <span style="color:#0b486b">3.2 Date</span>

Calendar date values are represented with the date class. Instances have attributes for year, month, and day.

In [None]:
import datetime

today = datetime.date.today()
print(today)
print('ctime:', today.ctime())
print('tuple:', today.timetuple())
print('ordinal:', today.toordinal())
print('Year:', today.year)
print('Mon :', today.month)
print('Day :', today.day)

A way to create new date instances is using the `replace()` method of an existing date. For example, you can change the year, leaving the day and month alone.

In [None]:
import datetime

d1 = datetime.date(2013, 3, 12)
print('d1:', d1)

d2 = d1.replace(year=2015)
print('d2:', d2)

<a id = "timedelta"></a>

### <span style="color:#0b486b">3.3 timedelta</span>
Using `replace()` is not the only way to calculate future/past dates. You can use datetime to perform basic arithmetic on date values via the timedelta class. 

In [None]:
today = datetime.datetime.today()
print(today)

tomorrow = today + datetime.timedelta(days=1)  
print(tomorrow)

<a id = "parsing"></a>

### <span style="color:#0b486b">3.4 Formatting and Parsing</span>

The default string representation of a datetime object uses the ISO 8601 format (YYYY-MM-DDTHH:MM:SS.mmmmmm). Alternate formats can be generated using `strftime()`. Similarly, if your input data includes timestamp values parsable with `time.strptime()`, then `datetime.strptime()` is a convenient way to convert them to datetime instances.

In [None]:
today = datetime.datetime.today()
print('ISO     :', today)

string from datetime object

In [None]:
str_format = "%a %b %d %H:%M:%S %Y"
s = today.strftime(str_format)
print('strftime:', s)

datetime object from string

In [None]:
print(s)

d = datetime.datetime.strptime(s, str_format)
print(d)
print('strptime:', d.strftime(str_format))

In [None]:
#Define a string variable "s", its value is 07/03/2017
s = "07/03/2017"
#Define the string format
str_format = "%m/%d/%Y"


d = datetime.datetime.strptime(s, str_format)
print(d)

---
## <span style="color:#0b486b">4. Numpy module</span>


Python lists are very flexible for storing any sequence of Python objects. But usually flexibility comes at the price of performance and therefore Python lists are not ideal for numerical calculations where we are interested in performance. Here is where **NumPy** comes in. It adds support for large, multi-dimensional arrays and matrices, along with high-level mathematical functions to operate on these arrays to Python. 

Relying on `'BLAS'` and `'LAPACK'`, `'NumPy'` gives a functionality comparable with `'MATLAB'` to Python. NumPy facilitates advanced mathematical and other types of operations on large numbers of data. Typically, such operations are executed more efficiently and with less code than is possible using Python’s built-in sequences. It has become one of the fundamental packages used for numerical computations.

In this tutorial we will review its basics, so to learn more about NumPy, visit [NumPy User Guide](http://docs.scipy.org/doc/numpy/user/index.html)

<a id = "importnp"></a>

### <span style="color:#0b486b">4.1 Importing Numpy</span>

First we have to import a package to be able to use it. NumPy is imported with:

In [None]:
import numpy

It is the convention to import it like with an alias:

In [None]:
import numpy as np

<a id = "nparray"></a>

### <span style="color:#0b486b">4.2 Numpy arrays</span>

The core of NumPy is its arrays. You can create an array from a Python list or tuple using `'array'` function. They work similarly to lists apart from the fact that:

* you can easily perform element-wise operation on them, and
* unlike lists, they should be pre-allocated.

The first point is further explained in the following array operations section. The second point means that you there is no equivalent to list append for arrays. The size of the arrays is known at the time it is defined.

#### <span style="color:#0b486b">4.2.1 create an array from a list</span>

In [None]:
x = [1, 7, 3, 4, 0, -5]

In [None]:
y = np.array(x)
type(y)

#### <span style="color:#0b486b">4.2.2 create an array using a range</span>

In [None]:
range(5)

In [None]:
print(np.array(range(5)))

In [None]:
print(np.arange(2, 3, 0.2))  #Why is there no value 3.0 in the output?

In [None]:
print(np.linspace(2, 3, 5))    # returns numbers spaced evenly on a linear scale, both endspoints are included


Just try to change the variable value 5 with 1, 2, 4  or 10? What pattern could you find? Could you guess what is the function of **linspace**, if without the given comments? 
Then, you can try to use the same method to learn what is the function of **logspace**?


In [None]:
print(np.logspace(2, 3, 5))   # returns numbers spaced evenly on a log scale

**Note:** If you need any help on how to use a function or what it does, you can IPython help. Just add a question mark (?) at the end of the function and execute the cell:

In [None]:
np.logspace?

#### <span style="color:#0b486b">4.2.3 create a prefilled array</span>

In [None]:
print(np.zeros(5))

In [None]:
print("The 1st sample is",np.ones(5, dtype=int))   # you can specify the data type, default is float
print("The 2nd sample is ",np.ones(5, dtype=float)) 
print("The 3rd sample is ",np.ones((5,5), dtype=int)) 
print("The 4th sample is ",np.ones((5,5,5), dtype=int)) 

In [None]:
np.ones?
#You can use this command to learn what is the function of np.ones

#### <span style="color:#0b486b">4.2.4 `'mgrid'`</span>
similar to meshgrid in MATLAB:

In [None]:
x, y = np.mgrid[0:5, 0:3]

print(x)
print(y)

In [None]:
np.mgrid?

The following picture might give a better example of how meshgrid works. Details are available [here]( 
https://docs.scipy.org/doc/numpy/reference/generated/numpy.meshgrid.html).

<img src="./images/p03/meshgrid.jpg" width="60%" height="60%">


#### <span style="color:#0b486b">4.2.5 array attributes</span>

NumPy arrays have multiple attributes and methods. The cell below shows a few of them. You can press tab after typing the dot operator `'(.)'` to use IPython auto-complete and see the rest of them.

In [None]:
y = np.array([3, 0, -4, 6, 12, 2])

In [None]:
print("number of dimensions:\t", y.ndim)        
print("dimension of the array:", y.shape)       
print("numerical data type:\t", y.dtype)
print("maximum of the array:\t", y.max())       
print("index of the array max:", y.argmax())    
print("mean of the array:\t", y.mean())      

#### <span style="color:#0b486b">4.2.6 Multi-dimensional arrays</span>


You can define arrays with 2 (or higher) dimensions in numpy:

##### from lists

In [None]:
x = [[1, 2, 10, 20], [3, 4, 30, 40]]
y = np.array(x)
print(y)
print()
print(y.ndim, y.shape)

##### pre-filled 

In [None]:
x = np.ones((3, 5), dtype='int')

In [None]:
print(x)
print()
print(x.ndim, x.shape)

##### `'diag()'`
diagonal matrix

In [None]:
np.diag([1, 2, 3])

<a id = "maninp"></a>

### <span style="color:#0b486b">4.3 Manipulating arrays</span>


#### <span style="color:#0b486b">4.3.1 Indexing</span>


Similar to lists, you can index elements in an array using `'[]'` and indices:

If `'x'` is a 1-dimensional array, `'x[i]'` will index `'ith'` element of `'x'`:

In [None]:
x = np.array([2, 8, -2, 4, 3])
print(x[3])

If 'x' is a 2-dimensional arrray:

* '`x[i, j]'` or `'x[i][j]'` will index the element in `'ith'` row and `'jth'` column
* '`x[i, :]'` will index the `'ith'` row 
* `'x[:, j]'` will index `'jth'` column

In [None]:
x = np.array([[7, 6, 8, 6, 4],
              [4, 7, -2, 0, 9]])
              
print(x[1, 3])

In [None]:
print(x[1, :])      # or x[1]

In [None]:
print(x[:, 3])

Arrays can also be indexed with other arrays:

In [None]:
x = np.array([2, 8, -2, 4, 3, 9, 0])

idx1 = [1, 3, 4]        # list
idx2 = np.array(idx1)   # array

print(x[idx1], x[idx2])
x[idx2] = 0
print(x)

You can also index masks. The index mask should be a NumPy arrays of data type Bool. Then the element of the array is selected only if the index mask at the position of the element is True.

In [None]:
x = np.array([2, 8, -2, 4, 3, 9, 0])

In [None]:
mask = np.array([False, True, True, False, False, True, False])

In [None]:
x[mask]

Combining index masks with comparison operaors enabels you to conditionally slecect elements of the array.

In [None]:
x = np.array([2, 8, -2, 4, 3, 9, 0])
mask = (x>=2) * (x<9)
x[mask]

#### <span style="color:#0b486b">4.3.2 Slicing</span>


Similar to Python lists, arrays can also be sliced:

In [None]:
x = np.array([2, 8, -2, 4, 3, 9, 0])

print(x[3:])    # slicing
print(x[3:7:2])  # slicing with a specified step

In [None]:
x = np.array([[7, 6, 8, 6, 4, 3],
              [4, 7, 0, 5, 9, 5],
              [7, 3, 6, 3, 5, 1]])
              

print(x[1, 1:4])
print()
print(x[:2, 1::2])    # rows zero up to 2, cols 1 up to end with a step=2

#### <span style="color:#0b486b">4.3.3 Iteration over items</span>


Since most of NumPy functions are capable of operating on arrays, in many cases iteration over items of an arrays can be (and should be) avoided. Otherwise it is pretty much similar to iterating over values of a list:

In [None]:
a = np.arange(0, 50, 7)
print(a)
for item in a:
    print(item,) 

Of course you could iterate over items using their indices too:

In [None]:
a = np.arange(0, 50, 7)
for i in range(a.shape[0]):
    print(a[i],)

There are also many functions for manipulating arrays. The most used ones are:

#### <span style="color:#0b486b">4.3.4 `copy()`</span>


**Remember** that assignment operator is not an equivalence for copying arrays. In fact Python does not pass the values. It passess the references as array is mutable class.

In [None]:
x = [1, 2, 3]
y = x
print(x, y)

In [None]:
y[0] = 0       # now we alter an element of y
print(x, y)     # note that x has changed as well

Same is true for numpy arrays. That's why if you need a copy of an array, you should use `'copy()'` function.

In [None]:
x = np.array([1, 2, 3])
y = x

y[0] = 0       # now we alter an element of y
print(x, y)     # note that x has changed as well

In [None]:
x = np.array([1, 2, 3])
y = x.copy()  # or np.copy(x)
y[0] = 0

print(x, y)

#### <span style="color:#0b486b">4.3.5 `reshape()`</span>


In [None]:
x1 = np.arange(6)
x2 = x1.reshape((2, 3))    # or np.reshape(x1, (2, 3))

print(x1)
print()
print(x2)

#### <span style="color:#0b486b">4.3.6 `astype()`</span>


Used for type casting:

In [None]:
x1 = np.arange(5)
x2 = x1.astype(float)

print(type(x1), x1)
print(type(x2), x2)

#### <span style="color:#0b486b">4.3.7 `T`</span> 

transpose method:

In [None]:
x1 = np.random.randint(5, size=(2, 4))
x2 = x1.T

print(x1)
print()
print(x2)

<a id = "arrayop"></a>

### <span style="color:#0b486b">4.4 Array operations</span>


#### <span style="color:#0b486b">4.4.1 Arithmetic operators</span>


Arrays can be added, subtracted, multiplied and divided using +, -, \* and, /. Operations done by these operators are **element wise**.

In [None]:
x1 = np.array([[2, 3, 5, 7], 
               [2, 4, 6, 8]], dtype=float)
x2 = np.array([[6, 5, 4, 3], 
               [9, 7, 5, 3]], dtype=float)

In [None]:
print(x1)
print()
print(x2)

In [None]:
print(x1 + x2)

In [None]:
print(x1 - x2)

In [None]:
print(x1 * x2)

In [None]:
print(x1 / x2)

In [None]:
print(3 + x1)

In [None]:
print(3 * x1)

In [None]:
print(3 / x1)

#### <span style="color:#0b486b">4.4.2 Boolean operators</span>

Much like arithmetic operators discussed above, boolean (comparison) operators perform element-wise on arrays.

In [None]:
x1 = np.array([2, 3, 5, 7])
x2 = np.array([2, 4, 6, 7])
y = x1<x2

print( y, y.dtype)

use methods `'.any()'` and `'.all()'` to return a single boolean value indicating whether any or all values in the array are True respectively. This value in turn can be used as a condition for an `'if'` statement.

In [None]:
print (y.all())
print (y.any())

NumPy has many other functions that you can read about them in [NumPy User Guide](http://docs.scipy.org/doc/numpy/user/). Specially read about:

* `np.unique`, returns unique elements of an array
* `np.flatten`, flattens a multi-dimensional array
* `np.mean`, `np.std`, `np.median`
* `np.min`, `np.max`, `np.argmin`, `np.argmax`

<a id = "random"></a>

### <span style="color:#0b486b">4.5 np.random</span>


NumPy has a module called `random` to generate arrays of random numbers. There are different ways to generate a random number:

In [None]:
print( np.random.rand())

In [None]:
# 2x5 random array drawn from standard normal distribution
print( np.random.random([2, 5]))

In [None]:
# 2x5 random array drawn from standard normal distribution
print (np.random.rand(2, 5))

What is the difference between np.random.rand() and np.random.random()?

Both functions generate samples from the uniform distribution on \[0, 1). The only difference is in how the arguments are handled. With numpy.random.rand, the length of each dimension of the output array is a separate argument. With numpy.random.random_sample or numpy.random.random_sample (numpy.random.random is actually an alias for numpy.random.random_sample), the shape argument is a single tuple. 

(source: [stackoverflow](https://stackoverflow.com/questions/47231852/np-random-rand-vs-np-random-random))

In [None]:
# 2x5 random array drawn from a uniform distribution on {0, 1, 2, ..., 9}
print (np.random.randint(10, size=[2, 5])) 

##### <span style="color:#0b486b">4.5.1 Random seed</span>


Random numbers generated by computers are not really random. They are called pseudo-random. Thus we can set the random generator to generate the same set of random numbers every time. This is useful while testing the code.

In [None]:
for i in range(5):
    print (np.random.random(),)    

In [None]:
for i in range(5):
    np.random.seed(100)
    print (np.random.random(),)    

---
## <span style="color:#0b486b">5. File I/O</span>

For Online platforms such as Google Colab or IBM Cloud, it is important for you to get familiar with the provided data storage or cloud data storage function. Alternatively, you can directly save the file and load into your Notebook.

In [None]:
!pip install wget

Then you can download the file into the online platform's file system.

In [None]:
import wget

link_to_data = 'https://raw.githubusercontent.com/gaoshangdeakin/SIT384/master/csv_data1.csv'
DataSet = wget.download(link_to_data)

<a id = "txt"></a>

### <span style="color:#0b486b">5.1 TXT</span>


TXT file format is the most simplestic way to store data. 

Load a TXT file with `'np.loadtxt()'`:

In [None]:
import numpy as np

# This code is for local PC
# x = np.loadtxt("data/txt_data1.txt")

# The following code for online platform such as colab or IBM Cloud
link_to_data = 'https://raw.githubusercontent.com/gaoshangdeakin/SIT384/master/txt_data1.txt'
DataSet = wget.download(link_to_data)

x = np.loadtxt("txt_data1.txt")
x

Save a TXT file with `'np.savetxt()'`:

In [None]:
y = np.random.randint(10, size=5)
np.savetxt("txt_data2.txt", y)
y

<a id = "csv"></a>

### <span style="color:#0b486b">5.2 CSV</span>


Comma Separated Values format and its variations are one the most used file format to store data.

You can use `'np.genfromtxt()'` to read a CSV file:

**NOTE:** The best way to read CSV and XLS files is using **pandas** package that will be introduced later.

In [None]:
import wget

link_to_data = 'https://raw.githubusercontent.com/gaoshangdeakin/SIT384/master/csv_data1.csv'
DataSet = wget.download(link_to_data)

print(DataSet)

In [None]:
import numpy as np

x = np.genfromtxt("csv_data1.csv", delimiter=",")
x

Use `'np.savetxt()'` to save a 2d-array in a CSV file.

In [None]:
x = np.random.randint(10, size=(6,4))
np.savetxt("csv_data2.csv", x, delimiter=',')
x

<a id = "json"></a>

### <span style="color:#0b486b">5.3 JSON</span>


JSON is the most used file format when dealing with web services. 

To read a JSON file, use `'json'` package and `'load()'` function, or `'loads()'` if the data is serialized. It reads the data and parses it into a dictionary.

In [None]:
link_to_data = 'https://raw.githubusercontent.com/gaoshangdeakin/SIT384/master/json_data1.json'
DataSet = wget.download(link_to_data)

print(DataSet)

In [None]:
import json
with open("json_data1.json", 'rb') as fp:
    fcontent = fp.read()
# data = json.loads(fcontent)
data = json.loads(fcontent.decode('utf-8'))
data.keys()

In [None]:
data

In [None]:
data['phoneNumbers']

You can also write a python dictionary into a JSON file. To do this use `'dump()'` or `'dumps()'` functions.

In [None]:
data = [{'Name': 'Zara', 'Age': 7, 'Class': 'First'}, 
        {'Name': 'Lily', 'Age': 9, 'Class': 'Third'}];
data

In [None]:
with open("json_data_now.json", 'w') as fp:
    json.dump(data, fp)