# Week 2- Data Structures and Dataframe Operations

**Objectives**: Today we are going to explore both Python-specific and general data structures that are important to data science. We will cover the following:
  
* Variables and datatypes
* Strings, lists, and tuples
* NumPy arrays
* Dictionaries and JSON
* Arrays, dataframes and series
* Basic dataframe operations

## Variables and Basic Datatypes

Variables in Python are names for values that are stored in memory. Typically, variables in Python are lower case letters that describe the value of the variable like:

<code>counter = 2</code>

Unlike other languages, Python uses dynamic typing, so datatypes are inferred from the value of the variable. In the "<code>counter</code>" example above, Python assumes that the "2" is an integer or type of <code>int</code>. 

Knowing that <code>counter</code> is an integer is critical if you plan to do any calculations with it. For example, if wanted to divide the value in <code>counter</code> by 2 expecting the answer to be .5, you would be mistaken. Try this in the code block below:

In [None]:
# Assign 2 to a variable named "counter"

# Divide "counter" by two


**Basic Built-in Numeric Datatypes**

Python 2.7.x has the following built-in numeric datatypes:

* <code>int</code>: plain integers like 4
* <code>float</code>: floating point numbers like 4.0
* <code>complex</code>: complex numbers like 4j

To determine the datatype of any object, you can use the <code>type</code> function like:

<code>type(object)</code>

Many analytic tasks use numeric types, so it is good to be familiar with them. Beyond numeric values, many analytic tasks will also include other built-in types like:

**Basic Built-in Boolean and Sequence Datatypes**

Boolean types:
* <code>bool</code>: boolean like True or False

Sequence types:
* <code>str</code>: string like this sequence of characters "this is a string"
* <code>list</code>: list like this list of ints: [1,2,3]
* <code>tuple</code>: tuple like this tuple of ints: (1,2,3)

The descriptions of each of these types is intentionally incomplete (why would you use a list versus a tuple, for example?), so you can learn more details here:

* https://docs.python.org/2/library/stdtypes.html

The inner-workings of each of these types can become complex, but understanding the basics of accessing data is important to a data scientist and there are some common approaches regardless of type.

For a sequence type, there are two ways to get parts of those sequences- indexing and slicing:

**Indexing**

Python uses a zero-based index that means the first element will be zero, the second element will be one, and so on.  To access a specific value, you append the index of that value to the sequence like:

<code>my_sequence[0] # The zero in square brackets returns the first or 0th element </code>

**Slicing**

Slices work by extending the idea of an index with an operator that allows you to select a range of index locations in a sequence type. Specifically, you use <code>:</code> and index locations to retrive parts of the sequence like:

````
my_sequence[start:end] # items from start through end-1
my_sequence[start:]    # items from start through the rest of the array
my_sequence[:end]      # items from the beginning through end-1
my_sequence[:]         # the whole array
````

Enter some different sequence types below and index and slice them to return their elements.

In [None]:
# Assign a string type sequence with five elements to a variable name

# Print the variable

# Return the 4th element

# Return from the 2nd element to the end of the sequence

# Assign a list type sequence with five elements to a variable name

# Print the list

# Return the 2nd element of the list

# Return from the 1st element to the end of the sequence

# Assign a tuple type sequence with five elements that are bools to a variable name

# Print the list

# Return the 2nd element

# Return from the 1st element to the end of the sequence

## NumPy Arrays

The basic variables and sequence types are quite useful for a variety of tasks, but arrays and dataframes are the workhorses of analytics and data science. Arrays and dataframes are great not only because they can represent multidimensional data like a spreadsheet, but also because they use vectorized code. In Python this means both NumPy arrays and the more general and flexible pandas dataframes use optimized C code that doesn't need to use Python for loops and can therefore be much more concise and much faster.

With large datasets, the speedup of vectorize operations can be significant when compared to <code>for</code> loops. When working with dataframes or arrays, try to think in matrix terms rather than iterating through each element.

**Numpy**

Numpy arrays can perform basic mathematics operations like add, subtract, multiply, and divide among arrays with the same shape on an element by element basic as follows.  Observe that this is truly element by element and not matrix multiplication.

In [None]:
import numpy as np

my_arr = np.array([[10,20,30],[100,200,300],[1,2,3]])
print 'this is my_arr:'
print my_arr

my_other_arr = np.array([[1,0,1],[0,1,0],[1,0,1]])
print 'this is my_other_arr:'
print my_other_arr

print 'this is the product of my two arrays:' 
print my_arr * my_other_arr

When the arrays are different shapes, Numpy can broadcast compatibly shaped arrays together. In the next example, NumPy takes the 3x3 <code>my_arr</code> array and multiplies it by the one-dimensional array <code>my_one_row_arr</code>.

In [None]:
my_one_row_arr = np.array([[100,0,10]])

print my_arr * my_one_row_arr

Broadcasting essentially replicates the lower dimensional array into the structure of the larger array.  In this case, broadcasting essentially does an element-wise multiplication by multiplying each row of the larger array by each element in the smaller one. 

Try adding an additional element to the 1x3 array to make it a 1x4 and rerun the code multiplying the two arrays. As you will see, broadcasting requires arrays that have compatible sizes.

**Indexing and Slicing Arrays**

Much of the same concepts we covered earlier work in a similar way for NumPy arrays except using multiple dimensions. To access a given element in a 2 dimnesional array the same basic Python syntax applies as in <code>x[obj]</code>. Instead of a single value index as in the sequence, you can either return a row with a scalar value or an element within a row by selecting <code>[row,row_index]</code>.  

Try accessing rows and individual elements in my array below.

In [None]:
my_arr[]

Slicing also works in a similar way by using the ellipsis. Try the following slices below.

In [None]:
# Slice to return the last two rows

my_arr[]

# Slice to return the 2nd column

my_arr[]

**Boolean Masks**

Boolean masks are a critical component of many analyses given they are simple, vectorized approaches to filtering and constructing new arrays for a topic of interest. Masks are conceptually simple. Run the code below to see what happens when we add a boolean test to our array. The syntax is the same as the earlier basic arithmatic operations that are broadcast across the array. 

In [None]:
my_arr > 20

The resulting array is the same shape, but has boolean values for the element-wise logical test. You can see this in the <code>dtype=bool</code> output.

Where this gets really cool is when we combine this result with our selection syntax.  Before you run the next cell, what do you think the results will be?

In [None]:
my_arr[my_arr > 20]

The result should be a one-dimensional array of the element-wise logical test.

If you want to learn more about indexing in NumPy try these resources:

* http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html

## Dictionaries and JSON

**Dictionaries**

Dictionaries are another important data type in Python. Dictionaries contain elements like sequences, but instead of referencing items by position, they use names. In the simplest terms, a dictionary is a collection of key-value pairs like:

````
{"user_number":123,
"user_name":"Issac"}
````

Where lists use <code>[]</code> and tuples use <code>()</code>, dictionaries or dicts use <code>{}</code>. To build a dictionary, within the curly brackets, keys preceed their values with a colon like

````
{'my key':126}
````

Dictionaries are unordered, so the only way to retrieve a specific value for a given key is via its name.  Dictionaries also have to have unique keys, but values need not be unique. Run the code below to demonstrate.

In [None]:
my_dict = {'user_number':123,
           'user_name':'Issac'}

print my_dict

print '\nThe user number is {}'.format(my_dict['user_number'])

While we have only been dealing with sequences and dictionaries with one level, all of these structures can support nested elements of multiple types. Try the following to explore this ability. Try some different datatypes in the value and key positions.

In [None]:
# Build a dictionary with two keys (odds and evens) and two values 
# that are lists of the odd and even numbers on a single six-sided die 
# and assign it to name "dice".

# Print the dictionary

# Print the second element of 'evens'

# Print the first element of 'odds'


**JSON and Structured, Semi-Structures, and Unstructured Data**

According to http://json.org:

>JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.
>
>JSON is built on two structures:
>
>* A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.
>
>* An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.


For those interested in data science and analytics, JSON is a very common datatype. Many new NoSQL databases use JSON or JSON-like models as their foundational data structure or offer features to handle JSON documents like:

* mongoDB- https://docs.mongodb.org/manual/core/data-modeling-introduction/
* DynamoDB- https://aws.amazon.com/blogs/aws/dynamodb-update-json-and-more/
* Spark SQL- https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html

An example JSON message from http://json.org/example.html follows.

````
{
    "glossary": {
        "title": "example glossary",
		"GlossDiv": {
            "title": "S",
			"GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
					"SortAs": "SGML",
					"GlossTerm": "Standard Generalized Markup Language",
					"Acronym": "SGML",
					"Abbrev": "ISO 8879:1986",
					"GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
						"GlossSeeAlso": ["GML", "XML"]
                    },
					"GlossSee": "markup"
                }
            }
        }
    }
}
````


If you are working with data generated from the web, odds are that that data will follow the JSON format. Even the file format of this Jupyter notebook is JSON. 

When talking in data science circles, JSON would be considered semi-structured data within this spectrum of structuredness:

* **Structured**: Structured data is typically referring to the well defined schemas of relational databases where there are predefined rows and columns or tables.

* **Semi-Structured**: Semi-structured data has some organization for its elements, but does not dictate a defined structure as in a relational database's table. Data can be retrieved from semi-structured data using a key or label as in the Python dictionary. Other examples of semi-structured data include JSON and CSV.

* **Unstructured**: Unstructured data lacks any predefined schema or model. Examples of unstructured data might include audio files or unlabeled text scrapped from the web.


**JSON in Python**

Conceptually, JSON is quite similar to Python's dictionary. In practice, however, JSON requires specific formatting to be valid and therefore interoperate with other systems. The Python JSON library makes it quite easy to both encode and decode to and from JSON and native Python datatypes. 

For encoding, <code>json.dumps(object)</code> is the primary method as can be seen below.

In [None]:
import json

python_dict = {'class':'big data', 'favorite technologies':['Python', 'Spark', 'AWS', 'Jupyter', 'Google BigQuery']}


# Convert it to Python
json.dumps(python_dict)

Look at the output of <code>json.dumps</code>.  How is it different from the dictionary object?  What is the return type of <code>json.dumps</code>?

> A quick aside: to determine a type of any object empircally, use <code>type(object)</code>


The json library predictably helps you read JSON into native Python datatypes too. Read our sample file below and then determine the type of the resulting object.

In [None]:
with open('./datasets/sample.json') as json_file:    # TODO: test on windoze
    data = json.load(json_file)

# What type is "data"


Now, print the contents of the file. For formatting, we will use the "pretty print" library to make the output nicer in our notebook- https://docs.python.org/2/library/pprint.html.

In [None]:
from pprint import pprint

pprint(data)

Using the same general approach to selecting data, assign the "company" field from the first element of the <code>data</code> object to a variable in the next cell.

In [None]:
# What is the company name of the first element in data


## pandas Series, Dataframe, and Panel

We used a pandas dataframe in the last exercise, so let's explore data selection, filtering, and some of the built-in statistics functions in more detail for it and series and panels.

**Indexing a Series**

Indexing a series uses the hopefully familar <code>series[label]</code> format. "label" for a series can either be the zero-index position or the label of the index. Try the following.

In [None]:
import pandas as pd
import math

# Make a series
my_series = pd.Series([42,'towel',math.pi], index=['one','two','three'])

# Print the series

# Return the 2nd element of the series by location

# Return the element called "three" by name

# What is the datatype of the "three" element


**Boolean Masks**

pandas masks are similar to those in NumPy of form series[series logical test] where the inner term returns a boolean mask to then derive the relevant elements from the outer series object.

Try this below by first creating a logical test to return the boolean mask of items that are "towel".  Refer to the Python logical operators listed under comparisons- https://docs.python.org/2/library/stdtypes.html 

In [None]:
# First, return a boolean vector of all the elements that are "towel"

# Now pass the vector to the series to return any that are True
