# COMP257 Week 2

Topics:
- Git Review
- Python Pandas, Series and DataFrames
- Getting Data
- Reading data in Python

## GIT

- Distributed Version Control
- Why are we introducing it here?
    - Any BIT student should be familiar with DVC
    - You are writing code - so you should be using VC
    - Provides an audit trail of your work on a project
    - You will be doing a group project, key to collaboration
    
There are lots of [guides to Git](http://rogerdudler.github.io/git-guide/) that will show you the basic commands and [explain how Git works](https://www.atlassian.com/git/tutorials/what-is-git) and [let you try commands](https://try.github.io/). 

You can learn Git on the command line or using a GUI.  Knowing the command line basics is useful if you are ever using it remotely (on a server for example).  Usually, using a GUI is the best idea for a beginner. One reason is that Git is quite complicated and it is easy to get yourself into a bit of a mess.

(Personal Opinion: Another DVC system - [Mercurial](https://www.mercurial-scm.org) - is much better than Git as it is more restricted in what you can do and has a more logical 'model' of changes. However, Git is now dominating the DVCS space so it makes sense to use it to be compatible with the developer community.)


## Fermi Estimation

* The task we did last week (how many loaves of bread) is an example of an Estimation Problem
* Fermi Estimation is a technique for making estimates of the _order of magnitude_ of a result
* Not precise but tries to estimate to the nearest power of 10
* A good technique for working out whether a claimed result is reasonable
* Example: [Case Study: Foodstamp Fraud](https://callingbullshit.org/case_studies/case_study_foodstamp_fraud.html)

## Finding Data

A look at some places that could be good sources of data for DS projects.  What kind of data formats do they use? 

- [Data.Gov.au](https://data.gov.au/) - official publication channel for Australian Govt. data
- [Australian Bureau of Statistics](https://abs.gov.au/) Census and other survey data
- [kaggle.com](https://www.kaggle.com/datasets) runs Data Science & Machine Learning competitions
- [Open Addresses](https://openaddresses.io)
- [Search for it](https://www.google.com.au/search?client=safari&rls=en&q=open+data&ie=UTF-8&oe=UTF-8&gfe_rd=cr&ei=WXKKWYbgH-3c8weDmYG4BA) 


## Data Formats

What formats will you find? 
- Excel/CSV - easy to read as long as the data is a simple table (but what if it isn't?)
- XML (eg. KML for geographical data)
- JSON
- PDF, Word, etc - often interesting data is locked in inappropriate formats

Eg. see [Data sets available from Transport for NSW](https://opendata.transport.nsw.gov.au/search/type/dataset?sort_by=changed) - allows you to filter by data format


Issues with data once you find it:
- Missing values for some fields in some records
- Values in fields are not consistent - eg. response to "What language do you speak at home?" or "What town were you born in?"
- incomplete records - need to link to other data sources


## Reading XML Data

XML is a widely used file format for data on the web (well, it has been in the past).   It is actually a document markup language like HTML, so it can represent quite complex structures. However it is often used to store simple tabular data as well.

Look at parsing a [sample xml file](files/sample.xml).  


In [1]:

import xml.etree.ElementTree as ET
tree = ET.parse('files/sample.xml')
root = tree.getroot()
for row in root:
    for child in row:
        print(child.tag)

Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date
City
Amount
Name
Date


## XML and Namespaces

* Often XML data contains namespaces
* Tag names are longer than they seem
* Need to use [Namespace support](https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces) to properly handle them

```xml
<?xml version="1.0" encoding="utf-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
  <Document>
    <Placemark>
```
([example.kml](files/example.kml))

In [2]:
import xml.etree.ElementTree as ET
tree = ET.parse('files/example.kml')
root = tree.getroot()
for row in root:
    for child in row:
        print(child.tag)

{http://www.opengis.net/kml/2.2}Placemark
{http://www.opengis.net/kml/2.2}Placemark
{http://www.opengis.net/kml/2.2}Placemark
{http://www.opengis.net/kml/2.2}Placemark
{http://www.opengis.net/kml/2.2}Placemark


## OpenRefine

[OpenRefine](http://openrefine.org/) is a tool for pre-processing data interactivly.  It can read various formats of data and help you generate consistent tabular data that can feed into your analysis.  From their home page, OpenRefine can:

- Import data in various formats
- Explore datasets in a matter of seconds
- Apply basic and advanced cell transformations
- Deal with cells that contain multiple values
- Create instantaneous links between datasets
- Filter and partition your data easily with regular expressions
- Use named-entity extraction on full-text fields to automatically identify topics
- Perform advanced data operations with the General Refine Expression Language

[An example](https://blog.ouseful.info/2013/05/03/a-wrangling-example-with-openrefine-making-ready-data/) of using OpenRefine to create a useable dataset.

Use [OpenRefine via myBinder](https://t.co/dOFzv7xjhz)
