# <span style="color:#0b486b">SIT 112 - Data Science Concepts</span>

---
Lecturer: Duc Thanh Nguyen | duc.nguyen@deakin.edu.au<br />

School of Information Technology, <br />
Deakin University, VIC 3215, Australia.

---

## <span style="color:#0b486b">Practical Session 2: Knowing your Data</span>

**The purpose of this session is to teach you:**

1. recognize different data formats, in particular: .csv, .xls, .xlsx, .json, .xml
2. basic input/output operations
3. how to load a specific package in python using 'import' statement
---

## <span style="color:#0b486b">Review</span>



<!-- <img src="files/images/slide2.png" width="800"> -->
<img src="images/slide2.png" width="800">

<br/>

<!-- <img src="files/images/slide3.png" width="800"> -->
<img src="images/slide3.png" width="800">

---
## <span style="color:#0b486b">1. Data file formats</span>

In many cases the data you need to work with is stored in files. Real world data usually appears in a file of some type such as txt, csv, xml, json, or so.

### <span style="color:#0b486b">1.1 TXT</span>
A flat-file is the most simplistic way to store the data. The data appears as a list of entries that you can read one at a time. The least formatted and thereof easiest to read flat-file format is the text file format. However, a text file also treats all data as strings, so you often have to convert numeric data into other forms. The image below shows an example of some data saved in a txt file.

<!-- <img src="files/images/txt_file_screenshot.png" width="629"> -->
<img src="images/txt_file_screenshot.png" width="629">

### <span style="color:#0b486b">1.2 CSV</span>
Comma-separated values (CSV) file type is similar to txt file type, but since the values are separated with commas, we can store tabular data in them. Each line of the file is a data record and each record consists of one or more fields, separated by commas. Sometimes instead of commas, other delimiters such as spaces or tabs are used. While these files contain tab-separated values or space-separated values,  they are usually saved with CSV extension.

<!-- <img src="files/images/csv_file_screenshot.png" width="629"> -->
<img src="images/csv_file_screenshot.png" width="629">

### <span style="color:#0b486b">1.3 XLS and XLSX</span>
These are `Microsoft Excel` file formats to store tabular data. XLS is a binary file format and XLSX is XML based.

<!-- <img src="files/images/xls_file_screenshot.png" width="629"> -->
<img src="images/xls_file_screenshot.png" width="629">

### <span style="color:#0b486b">1.4 XML</span>

XML is an `Extensible Markup Language` file format that is used to create information formats and share both the format and the data using standard text. XML describes the content in terms of what data is being described. For example, the word "phonenum" placed within markup tags could indicate that the data that followed was a phone number. XML is considered extensible because unlike HTML, the markup tags are unlimited and self-defined. Most of the data you collects from the web is in XML or JSON format.

<!-- <img src="files/images/xml_file_screenshot.png" width="629"> -->
<img src="images/xml_file_screenshot.png" width="629">

### <span style="color:#0b486b">1.5 JSON</span>

`JavaScript Object Notation` is a lightweight file format that is used for data interchanging. It uses human readable text to transmit data objects as pairs of attributes-values. It is an alternative to XML for transmitting data between a server and a web application and usually preferred to XML, since it is lighter. It was derived from JavaScript language, hence the name. But it is a language independent data format and code for parsing and generating JSON data is readily available in many languages.

<!-- <img src="files/images/json_file_screenshot.png" width="629"> -->
<img src="images/json_file_screenshot.png" width="629">

**Exercise:**

Go to http://data.gov.au and look for a dataset of your interest. Download it and browse the dataset. Can you make sense of the dataset by looking at the raw data? Think about what can you do to obtain a better insight of the dataset you have downloaded.

---
## <span style="color:#0b486b">2. Python packages</span>

After completing [practical session 1](01-prac1.ipynb), you should know about the syntax and semantics of the Python language. But apart from that, you should also learn about Python libraries and its packages to be able to code efficiently. Python’s standard library is very extensive, offering a wide range of facilities as indicated [here](https://docs.python.org/2/library/). The library contains built-in modules (written in C) that provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers, as well as modules written in Python that provide standardized solutions for many problems that occur in everyday programming. Look at the [Python Standard Library Manual](https://docs.python.org/2/library/) to read more.

In addition to the standard library, there is a growing collection of several thousand components (from individual programs and modules to packages and entire application development frameworks), available from the [Python Package Index](https://pypi.python.org/pypi).

### <span style="color:#0b486b">2.1 Standard libraries</span>

For a complete list of Python standard library and their documentation look at the [Python Manual.](https://docs.python.org/2/library/) A few to mention are:

* ``math`` for numeric and math-related functions and data types
* ``urllib`` for fetching data across the web
* ``datetime`` for manipulating dates and times
* ``pickle`` and ``cPickle`` for serializing and deserializing data structures enabling us to save our variables on the disk and load them from the disk
* ``os`` for os dependent functions

### <span style="color:#0b486b">2.2 Third party packages</span>

There are thousands of third party packages, each developed for a special task. Some of the useful libraries for data science are:

* ``numpy`` is probably the most fundamental package for efficient scientific computing in Python
* ``scipy`` is one of the core packages for scientific computations
* ``pandas`` is a library for operating with table-like data structures called DataFrame object
* ``matplotlib`` is a comprehensive plotting library
* ``BeautifulSoup`` is an HTML and XML parser
* ``scikit-learn`` is the most general machine learning library for Python
* ``nltk`` is a toolkit for natural language processing

### <span style="color:#0b486b">2.3 How to install a package</span>

The easiest way to install a package is using `conda` (if you are using Anaconda) or `pip` commands. Suppose you want to install the package `NLTK`. Either:
    
    > conda install nltk
    
or    
    
    > pip install nltk
    
    
will install the package.

---
## <span style="color:#0b486b">3. Importing a module</span>

To use a module, first you have to ``import`` it. There are different ways to import a module:

* `import my_module`
* `from my_module import my_function`
* `from my_module import my_function as func`
* `from my_module import submodule`
* `from my_module import submodule as sub`
* `from my_module import *`

**`'import my_module'`** imports the module `'my_module'` and creates a reference to it in the namespace. For example `'import math'` imports the module `'math'` into the namespace. After importing the module this way, you can use the dot operator `(.)` to refer to the objects defined in the module. For example `'math.exp()'` refers to function `'exp()'` in module `'math'`.

In [None]:
import math

x = 2
y1 = math.exp(x)
y2 = math.log(x)

print "e^{} is {} and log({}) is {}".format(x, y1, x, y2)

**`'from my_module import my_function'`** only imports the function `'my_function'` from the module `'my_module'` into the namespace. This way you won't have access to neither the module (since you have not imported the module), nor the other objects of the module. You can only have access to the object you have imported.

You can use a comma to import multiple objects.

In [None]:
from math import exp

x = 2
y = exp(x)  # no need to math.exp()

print "e^{} is {}".format(x, y)

**`'from my_module import my_function as func'`** imports the function `'my_function'` from module `'my_module'` but its identifier in the namespace is changed into `'func'`. This syntax is used to import submodules of a module as well. For example later you will see that nowadays it is almost a convention to import matplotlib.pyplot as plt.

In [None]:
# you can change the name of the imported object
from math import exp as myfun

x = 2
y = myfun(x)

print "e^{} is {}".format(x, y)

**`'from my_module import *'`** imports all the public objects defined in `'my_module'` into the namespace. Therefore after this statement you can simply use the plain name of the object to refer to it and there is no need to use the dot operator:

In [None]:
from math import *

x = 2
y1 = exp(x)
y2 = log(x)

print "e^{} is {} and log({}) is {}".format(x, y1, x, y2)

**Exercise1:** 

1. import the library `math` from standard Python libraries
2. define a variable and assign an integer value to it (smaller than 20)
3. use `factorial()` function (an object in `math` library) to calculate the factorial of the variable
4. print its value

In [None]:
# your code here

**Exercise2:**

1. write a function that takes an integer variable and returns its factorial
2. use it to find the factorial of the variable defined in Exercise1
3. do your answeres match?

In [None]:
# your code here

---
## <span style="color:#0b486b">4. Python simple input/output</span>

### <span style="color:#0b486b">4.1 Input</span>

`raw_input()` asks the user for a string of data (ended with a newline), and simply returns the string.

In [None]:
x = raw_input('What is your name? ')

print "x is {}".format(type(x))
print "Your name is {}".format(x)

**Exercise3:**

1. use `raw input()` to take a float value between -1 and 1 from the user
2. use the function `acos()` from `math` to find the arc cosine of it
3. print the value of the variable and its arc cosine

In [None]:
# your code here

As we know the domain of [arc cosine function][acos] is [-1, 1]. So, what if the value entered by the user is not in the domain (the value is smaller than -1 or greater that 1)? What happens then? 

To avoid raising a ValueError exception, before passing the value to `acos()` function make sure it is in range and if not, display an appropriate message.

[acos]: http://mathworld.wolfram.com/InverseCosine.html

In [None]:
# your code here

### <span style="color:#0b486b">4.2 output</span>

The basic way to do output is the print statement. To print multiple things on the same line separated by spaces, use commas between them.

In [None]:
name = "John"
msg = "Hello"

print msg
print msg, name

Objects can be printed on the same line without needing to be on the same line if one puts a comma at the end of a print statement:

In [None]:
for i in range(10):
    print i,

---
## <span style="color:#0b486b">5. datetime module</span>


The datetime module includes functions and classes for date and time parsing, formatting, and arithmetic.

### <span style="color:#0b486b">5.1 Time</span>

Time values are represented with the time class. Times have attributes for hour, minute, second, and microsecond. They can also include time zone information.

In [None]:
import datetime

t = datetime.time(11, 21, 33)
print t
print 'hour  :', t.hour
print 'minute:', t.minute
print 'second:', t.second
print 'microsecond:', t.microsecond
print 'tzinfo:', t.tzinfo
t.tzname = 

### <span style="color:#0b486b">5.2 Date</span>

Calendar date values are represented with the date class. Instances have attributes for year, month, and day.

In [None]:
import datetime

today = datetime.date.today()
print today
print 'ctime:', today.ctime()
print 'tuple:', today.timetuple()
print 'ordinal:', today.toordinal()
print 'Year:', today.year
print 'Mon :', today.month
print 'Day :', today.day

A way to create new date instances is using the `replace()` method of an existing date. For example, you can change the year, leaving the day and month alone.

In [None]:
import datetime

d1 = datetime.date(2013, 3, 12)
print 'd1:', d1

d2 = d1.replace(year=20015)
print 'd2:', d2

**Exercise4:**

1. Write a piece of code that gives you the day of the week that you were born.
2. How about thisn year? Do you know what day of the week is it?

In [None]:
# your coede here

### <span style="color:#0b486b">5.3 timedelta</span>

Using `replace()` is not the only way to calculate future/past dates. You can use datetime to perform basic arithmetic on date values via the timedelta class. 

In [37]:
today = datetime.datetime.today()
print today

tomorrow = today + datetime.timedelta(days=1)
print tomorrow

2015-07-30 11:40:15.532000
2015-07-31 11:40:15.532000


**Exercise5:**

Rewrite exercise4 with timedelta.

In [None]:
# your code here

You can use comparison operators for datetime objects too. It makes sense right?

In [38]:
tomorrow > today

True

### <span style="color:#0b486b">5.4 Formatting and Parsing</span>

The default string representation of a datetime object uses the ISO 8601 format (YYYY-MM-DDTHH:MM:SS.mmmmmm). Alternate formats can be generated using `strftime()`. Similarly, if your input data includes timestamp values parsable with `time.strptime()`, then `datetime.strptime()` is a convenient way to convert them to datetime instances.

In [49]:
today = datetime.datetime.today()
print 'ISO     :', today

ISO     : 2015-07-30 11:47:59.197000


string from datetime object

In [54]:
str_format = "%a %b %d %H:%M:%S %Y"
s = today.strftime(str_format)
print 'strftime:', s

strftime: Thu Jul 30 11:47:59 2015


datetime object from string

In [55]:
print s

d = datetime.datetime.strptime(s, str_format)
print d
print 'strptime:', d.strftime(str_format)

Thu Jul 30 11:47:59 2015
strptime: Thu Jul 30 11:47:59 2015


In [60]:
s = "07/30/2015"
str_format = "%m/%d/%Y"

d = datetime.datetime.strptime(s, str_format)
print d

2015-07-30 00:00:00


**Exercise6:**

You have a string as "7/30/2015 - 12:13". How do you convert it into a datetime object?

In [None]:
# your code here