# Lab 2: Types of Data

So far, we've only used Python to manipulate numbers.  In this lab, you'll see how to represent and manipulate some another fundamental type of data: text.  A piece of text is called a *string* in Python.

You'll also see how to invoke *methods*, which are functions that are specialized to a piece of data.

Last, you'll see how to work with datasets in Python -- collections of data like text or numbers.  In this class, we principally use two kinds of *collections*:
  * **Arrays:** An array is a collection of many pieces of one kind of data.  An array is like a single column in an Excel spreadsheet.
  * **Tables:** A table is a collection of many pieces of different kinds of data.  Each kind of data is in its own *column*.  A table is like an entire Excel spreadsheet.

For your later reference, here are the principal types of data we'll work with in this course.

|English name|Python name|Example|Example Python expressions|
|-|-|-|-|
|Number|`float` (numbers with decimals) or `int` (integers)|The number of words in a book|`2`, `.25`, `2+2`|
|Text|`string`|A word, chapter, or whole text of a book|`"I <3 Data Science"`|
|A collection of one kind of data|`array`|All the numerical grades on a homework assignment; all the letter grades in a class|`Table.read_table("course_data.csv").column("Grades")`|
|A collection of multiple kinds of data|`table`|The letter grades and all the project, midterm, and final exam scores in a class|`Table.read_table("course_data.csv")`|

The material in this lab corresponds roughly to the second and third weeks of the semester-long course.

# 1. Text
Programming doesn't just concern numbers. Text is one of the most common types of values used in programs. 

A snippet of text is represented by a *string value* in Python. The word "*string*" is a computer science term for a sequence of characters. A string might contain a single character, a word, a sentence, or a whole book. Strings have quotation marks around them. Single quotes (`'`) and double quotes (`"`) are both valid. The contents can be any sequence of characters, including numbers and symbols. 

We've seen strings before in `print` statements.  Below, two different strings are passed as arguments to the `print` function. 

In [19]:
print("I <3", 'Data Science')

`print` prints all of its arguments together, separated by single spaces.

Just like names can be given to numbers, names can be given to string values.  The names and strings aren't required to be similar. Any name can be given to any string.

**Question 1.1.** Replace each `...` with a single-word string literal below so that the final expression prints its punchline.

<img src="https://s-media-cache-ak0.pinimg.com/236x/1c/8f/cf/1c8fcf8892a14c474e77dcb108e7e8f8.jpg"/>

In [3]:
why_was = "Because"
six = ...
afraid_of = "eight"
seven = ...
print(why_was, six, afraid_of, seven)

In [4]:
# Run this cell after the one above to test your work (or get a hint).
from client.api.assignment import load_assignment 
lab02 = load_assignment('longlab02.ok')
_ = lab02.grade('q11')

## 1.1. String Methods

Strings can be transformed using *methods*, which are functions that involve an existing string and some other arguments. For example, the `replace` method replaces all instances of some part of a string with some replacement. A method is invoked on a string by placing a `.` after the string value, then the name of the method, and finally parentheses containing the arguments. 

    <string>.<method name>(<argument>, <argument>, ...)

Otherwise, a method is pretty similar to a function.

Try to predict the output of these examples, then execute them.

In [22]:
# Replace one letter
'Hello'.replace('o', 'a')

In [23]:
# Replace a sequence of letters, which appears twice
'hitchhiker'.replace('hi', 'ma')

In [24]:
# Two calls to replace
'train'.replace('t', 'ing').replace('in', 'de')

Once a name is bound to a string value, methods can be invoked on that name as well. The name doesn't change in this case, so a new name is needed to capture the result. 

In [25]:
sharp = 'edged'
hot = sharp.replace('ed', 'ma')
print('sharp =', sharp)
print('hot =', hot)

**Question 1.1.1.** Assign strings to the names `you` and `this` so that the final expression evaluates to a 10-letter English word with three double letters in a row. You can run the tests for a hint.

In [26]:
you = ...
this = ...
a = 'beeper'
the = a.replace('p', you) 
the.replace('bee', this)

In [2]:
_ = lab02.grade('q111')

Other string methods do not take any arguments at all, because the original string is all that's needed to compute the result. In this case, parentheses are still needed, but there's nothing in between the parentheses. For example, 

    lower:      return a lowercased version of the string
    upper:      return a lowercased version of the string
    capitalize: return a version with the first letter capitalized
    title:      return a version with the first letter of every word capitalized

In [28]:
'unIverSITy of caliFORnia'.title()

Methods can also take arguments that aren't strings.  For example, strings have a method called `zfill` that "pads" them with the character `0` so that they reach a certain length.  This can be useful for displaying numbers in a uniform format:

In [29]:
print("5.12".zfill(6))
print("10.50".zfill(6))

All these string methods are useful, but most programmers don't memorize their names or how to use them.  You can refer back to this lab for the ones we mention, or just Google search for something like "`how to pad a string in Python`".  The website Stack Overflow often has useful answers.

## 1.2. Converting to and from Strings

Strings and numbers are different *types* of values, even when a string contains the digits of a number. For example, evaluating the following cell causes an error because an integer cannot be added to a string.

In [30]:
8 + "8"

However, there are built-in functions to convert numbers to strings and strings to numbers. 

    int:   Converts a string of digits without a decimal point to a number
    float: Converts a string of digits, perhaps with a decimal point, to a number
    str:   Converts any value to a string

Suppose you're writing a program that looks for dates in a text, and you want your program to find the difference between two years it has identified.  It doesn't make sense to subtract two texts, but you can first convert the text containing the years into numbers:

In [37]:
one_year = "1618"
another_year = "1648"

# We can't just write:
#   another_year - one_year
# If you don't see why, try writing that here and running it.
difference = int(another_year) - int(one_year)
difference

Try to predict what the following cell will evaluate to, then evaluate it.

In [31]:
int("20" + str(8 + int("8")))

## 1.3. Strings as function arguments
String values, like numbers, can be arguments to functions and can be returned by functions.  The function `len` takes a single string as its argument and returns the number of characters in the string: its **len**gth.

**Question 1.3.1.**  Use `len` to find out the length of the very long string in the next cell.  (It's the first sentence of the French [Declaration of the Rights of Man and Citizen](http://avalon.law.yale.edu/18th_century/rightsof.asp).)  The length of a string is the total number of **characters** in it, including things like spaces and punctuation.  Assign `sentence_length` to that number.

In [3]:
a_very_long_sentence = "Les Représentans du Peuple François, constitués en Assemblée Nationale, considérant que l’ignorance, l’oubli ou le mépris des droits de l’Homme sont les seules causes des malheurs publics et de la corruption des Gouvernemens, ont résolu d’exposer, dans une Déclaration solemnelle, les droits naturels, inaliénables et sacrés de l’Homme, afin que cette Déclaration, constamment présente à tous les Membres du corps social, leur rappelle sans cesse leurs droits et leurs devoirs ; afin que les actes du pouvoir législatif, et ceux du pouvoir exécutif pouvant à chaque instant être comparés avec le but de toute institution politique, en soient plus respectés ; afin que les réclamations des Citoyens, fondées désormais sur des principes simples et incontestables, tournent toujours au maintien de la Constitution, et au bonheur de tous."
sentence_length = ...
sentence_length

In [35]:
_ = lab02.grade('q131')

# 2. Importing code

> What has been will be again,  
> what has been done will be done again;  
> there is nothing new under the sun.

Most programming involves work that is very similar to work that has been done before.  Since writing code is time-consuming, it's good to rely on others' published code when you can.  Rather than copy-pasting their code into ours, Python allows us to *import* it, creating a *module* that contains all of the names created by that code.

Python includes many useful modules that are just an `import` away.  We'll look at the `math` module as a first example.

Suppose we want to very accurately compute the area of a circle with radius 5 meters.  For that, we need the constant $\pi$, which is roughly 3.14.  Conveniently, the `math` module defines `pi` for us:

In [4]:
import math
radius = 5
area_of_circle = radius**2 * math.pi
area_of_circle

`pi` is defined inside `math`, and the way that we access names that are inside modules is by writing the module's name, then a dot, then the name of the thing we want:

    <module name>.<name>
    
In order to use a module at all, we must first write the statement `import <module name>`.  That statement creates a module object with things like `pi` in it and then assigns the name `math` to that module.  Above we have done that for `math`.

`math` also provides the name `e` for the base of the natural logarithm, which is roughly 2.71.  Here we've computed $e^{\pi}-\pi$, giving it the name `near_twenty`.

In [39]:
near_twenty = math.e ** math.pi - math.pi
near_twenty

![XKCD](http://imgs.xkcd.com/comics/e_to_the_pi_minus_pi.png)

## 2.1. Importing functions
Modules can provide other named things, including functions.  For example, `math` provides the name `sin` for the sine function.  Having imported `math` already, we can write `math.sin(3)` to compute the sine of 3.  (Note that this sine function considers its argument to be in [radians](https://en.wikipedia.org/wiki/Radian), not degrees.  180 degrees are equivalent to $\pi$ radians.)

Here we've computed the sine of $\frac{\pi}{4}$ using `sin` and `pi` from the `math` module.

<img src="http://mathworld.wolfram.com/images/eps-gif/TrigonometryAnglesPi4_1000.gif">
(Source: [Wolfram MathWorld](http://mathworld.wolfram.com/images/eps-gif/TrigonometryAnglesPi4_1000.gif))

In [41]:
sine_of_pi_over_four = ...
sine_of_pi_over_four

For your reference, here are some more examples of functions from the `math` module:

In [43]:
# Calculating factorials.
math.factorial(5)

In [44]:
# Calculating logarithms (the logarithm of 8 in base 2).
math.log(8, 2)

In [45]:
# Calculating square roots.
math.sqrt(5)

In [46]:
# Calculating cosines.
math.cos(math.pi)

##### A function that displays a picture
People have written Python functions that do very cool and complicated things, like crawling web pages for data, transforming videos, or doing machine learning with lots of data.  Now that you can import things, when you want to do something with code, first check to see if someone else has done it for you.  Let's see an example of a function that's used for downloading and displaying pictures.

The module `IPython.display` provides a function called `Image`.  `Image` takes a single argument, a string that is the URL of the image on the web.  It returns an *image* value that this Jupyter notebook understands how to display.  To display an image, make it the value of the last expression in a cell, just like you'd display a number or a string.

In the next cell, we've imported the module `IPython.display` and used its `Image` function to display the image at this URL, which depicts the death of Socrates:

https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/David_-_The_Death_of_Socrates.jpg/1024px-David_-_The_Death_of_Socrates.jpg

In [4]:
import IPython.display
art = IPython.display.Image('https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/David_-_The_Death_of_Socrates.jpg/1024px-David_-_The_Death_of_Socrates.jpg')
art

# 3. Arrays
Up to now, we haven't done much that you couldn't do yourself by hand or with a calculator.  Computers are most useful when you can use a small amount of code to *do the same action* to *many different things*.  For example, in the time it takes you to calculate an 18% tip on a restaurant bill, a laptop can calculate 18% tips for every restaurant bill paid by every human on Earth that day.  (That's if you're pretty fast at doing arithmetic in your head!)

Arrays are how we put many values in one place so that we can operate on them as a group.  For example, if `billions_of_numbers` is an array of numbers, the expression `0.18 * billions_of_numbers` gives a new array of numbers that's the result of multiplying each number in `billions_of_numbers` by 0.18.  Arrays are not limited to numbers; we can also put all the words in a book into an array of strings.

Arrays are provided by a package called [NumPy](http://www.numpy.org/) (pronounced "NUM-pie" or, if you prefer to pronounce things incorrectly, "NUM-pee").  The package is called `numpy`, but it's standard to rename it `np` for brevity.  You can do that with:

    import numpy as np

We've created a companion package of functions for working with large datasets.  It's called `datascience`.  It's typically imported with:

    from datascience import *

...which is just a slightly different way of importing a module.  Run the next cell to import both.

In [38]:
# Just run this cell.
import numpy as np
from datascience import *

You can type in an array yourself, but that's not typically how programs work.  Normally, we create arrays by loading them from an external source, like a data file.

The next cell loads an array containing the world population in each year from 1950 to roughly the present.  You'll learn what a Table is soon.  (The estimates come from the [US Census Bureau website](http://www.census.gov/population/international/data/worldpop/table_population.php).  You can check out the data file [here](world_population.csv) if you want.)

In [7]:
world_population = Table.read_table("world_population.csv").column("Population")
world_population

We can get individual numbers out of our array by calling the `item` method on it.

In [None]:
population_1950 = world_population.item(0)
population_1950

In [None]:
population_1962 = world_population.item(12)
population_1962

Notice that, to get the first population value in our array, we asked for item "0", not item 1.  In Python, and in many programming languages, a thing in an array is referred to by its "index," and indices start from 0, not 1.

This is often confusing for new programmers.  A good way to remember it is that the index of an item is the number of things before it in an array.

Fortunately, since arrays are primarily useful for doing the same operation many times, we don't often have to work with single elements anyway.

## 3.1. Doing something with every element of an array
##### Logarithms
Here is one simple question we might ask about world population:

> How big is the population in *orders of magnitude*?

The logarithm function is one way of measuring how big a number is. The logarithm (base 10) of a number increases by 1 every time we multiply the number by 10. It's like a measure of how many decimal digits the number has, or how big it is in orders of magnitude.

We could try to answer our question like this, using the `log10` function from the `math` module:

In [39]:
population_1950_magnitude = math.log10(world_population.item(0))
population_1951_magnitude = math.log10(world_population.item(1))
population_1952_magnitude = math.log10(world_population.item(2))
population_1953_magnitude = math.log10(world_population.item(3))
...

But this is tedious and doesn't really take advantage of the fact that we are using a computer.

Instead, NumPy provides its own version of `log10` that takes the logarithm of each element of an array.  It takes a single array of numbers as its argument.  It returns an array of the same length, where the first element of the result is the logarithm of the first element of the argument, and so on.

In [9]:
population_magnitudes = np.log10(world_population)
population_magnitudes

<img src="array_logarithm.jpg">

This is called *elementwise* application of the function, since it operates separately on each element of the array it's called on.  The textbook's section on arrays has a useful list of NumPy functions that are designed to work elementwise, like `np.log10`.

##### Arithmetic
Arithmetic also works elementwise on arrays.  For example, you can divide all the population numbers by 1 billion to get numbers in billions:

In [10]:
population_in_billions = world_population / 1000000000
population_in_billions

You can do the same with addition, subtraction, multiplication, and exponentiation (`**`).  For example, you can calculate a tip on several restaurant bills at once (in this case just 3):

In [14]:
restaurant_bills = Table.read_table("restaurant_bills.csv").column("Bill")
print(restaurant_bills)
tips = .2 * restaurant_bills
print(tips)

<img src="array_multiplication.jpg">

You can also do arithmetic on many *pairs* of numbers at once, if they're in two arrays of the same length.  For example, to compute the 3 bill totals, we can add each bill and tip together:

In [13]:
total_bills = restaurant_bills + tips

<img src="array_addition.jpg">

To include a 9.5% sales tax and a flat $2 delivery charge, we could perform array arithmetic several times:

In [15]:
tax_rate = .095
delivery_charge = 2
total_bills_with_tax = restaurant_bills + tips + tax_rate*restaurant_bills + delivery_charge

### 3.2. Making a range of numbers
Very often in data science, we want to work with many numbers that are evenly spaced.  For example, suppose we want to label each element of `world_population` with the year it came from.  The years ranged from 1950 to 2015 at intervals of 1, so we would write this:

In [40]:
years = np.arange(1950, 2016, 1)
years

`np.arange(start, stop, space)` produces an array with all the numbers starting at `start` and counting up by `space`, stopping before `stop` is reached.  So the second argument to `arange` was 2016, *not* 2015, because we want the numbers to stop right before 2016.  (Like 0-based indexing, this can be confusing.)

More commonly, we just want an array of numbers in order, say the first 1000 numbers.  `np.arange(1000)` is shorthand for `np.arange(0, 1000, 1)`, so it's useful for that purpose.

##### Perfect squares
For example, suppose we want to know the first 100 perfect squares. (A perfect square is a number that is equal to some integer times itself, like 4 ($2 \times 2$) or 169 ($13 \times 13$).)  Instead of checking a bunch of numbers to see if they're perfect squares, an easy way to do this is to square all the numbers from 0 to 99.  We can use `np.arange` and elementwise exponentiation `**` to do that:

In [None]:
first_100_numbers = np.arange(100)
first_100_perfect_squares = first_100_numbers**2
first_100_perfect_squares

For a challenge, try computing the first 10 *powers of 2*, which are 1, 2, 4, 8, 16, 32, 64, 128, 256, and 512.  *Hint:* Equivalently, we could have written, in Python, `2**0`, `2**1`, `2**2`, `2**3`, etc.

In [41]:
first_10_powers_of_2 = ...
first_10_powers_of_2

### 3.3. Arrays of other things
An array can contain any one kind of data, not just numbers.  In particular, when we work with text data, we often use arrays of strings.

For example, if we want to work with each chapter of *Little Women*, we can make the text of each chapter a separate element of an array:

In [48]:
little_women_chapters = Table.read_table("little_women_chapters.csv").column("Chapter Text")
little_women_chapters

We can use elementwise functions to perform operations on many strings at once.  Let's count how many times Jo appears in each chapter:

In [49]:
jo_counts = np.char.count(little_women_chapters, "Jo")
jo_counts

NumPy provides some nice functions for manipulating strings this way, but often it doesn't have exactly what you want.  It's more common to define your own function and apply it to every string in an array.  You'll see how to do that later.

# 4. Tables
Arrays are useful for holding and working with one kind of data about many entities.  For example, `world_population` contains the world population for each year from 1950 to 2015, and `years` contains the years in which the measurements were taken, 1950 to 2015.  But what if we want to ask a question that combines those two pieces of information, like:

> In what year did the world population cross 6 billion?

Tables let us answer this kind of question by associating several arrays of information about the same entities.

## 4.1. Making a table
Let's put the world populations together with their years:

In [30]:
population_table = Table().with_column("Year", years).with_column("Population", world_population)
population_table

To make that table, we:
1. Made an empty table by calling the function `Table`.
2. Extended that table with the column of years, named "Year".  We did this by calling `with_column` on the empty table.
3. Extended that table with the column of populations, named "Population".  We did this by calling `with_column` on the table with only the Year column.

Usually, though, we load tables directly from a data file.  The function `Table.read_table` does that.

Let's load data about some highly-rated movies, which are provided by imdb.com:

In [31]:
imdb = Table.read_table("imdb_ratings.csv")
imdb

If we want just the ratings of the movies, we can get an array that contains the data in that column:

In [32]:
imdb.column("Rank")

We could find the rating of the highest-rated movie using just that array, like this:

In [33]:
highest_rating = max(imdb.column("Rank"))
highest_rating

But that's not very useful; you'd probably want to know the *name* of the highest-rated movie.  To do that, we can sort the table by rating.

In [35]:
imdb.sort("Rank")

Well, that actually doesn't help much -- now we know the lowest-rated movies.  To look at the highest-rated movies, sort in reverse order:

In [36]:
imdb.sort("Rank", descending=True)

So the highest-rated movies in the dataset are *The Shawshank Redemption* and *The Godfather*.

Tables provide many other kinds of functionality that we'll explore throughout this class.  They're documented [here](http://datascience.readthedocs.io/en/master/), and their source code is [here](https://github.com/data-8/datascience/tree/master/datascience), if you're interested.