# Please go to https://ccv.jupyter.brown.edu

## What we learned so far... 
- Why programming is useful
- Decide if a task is a general coding task or a machine learning task
- Basic python: variables, control flow and loops

# 2. Container Types
### By the end of the day you'll be able to
- Know which container type to use for various programming tasks
- Convert/"cast" between container types
- Always read the help() to understand new Python code
- Read and debug error tracebacks

* Containers are objects that "hold" other objects
* Useful for storing, accessing, and iterating over an arbitrary number of objects
* Container types include `list`, `array`, `dict`, `tuple`, `set`, `DataFrame`

## 2.1 The list Type
* Can store arbitrary elements
* A single list can store a mix of different types
* One of the most-often used objects in Python

In [None]:
foo = []
foo

In [None]:
groceries = ["milk", "bread", "apples", "eggs"]
groceries

In [None]:
prime_nums = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

In [None]:
stats = ['ashley', 9.9, 'providence', True]

### 2.1.1 Indexing a list
Behaves much like indexing a string

In [None]:
prime_nums

In [None]:
len(prime_nums)     # get length of list

In [None]:
prime_nums[0]       # get second element

In [None]:
prime_nums[1]      # get last element

### 2.1.2 Slicing

Just as we can use square brackets to access individual list elements, we can also use them to access sublists with the slice notation, marked by the colon (:) character. 

`x[start:stop:step]`

If any of these are unspecified, they default to the values: 
- `start=0`
- `stop=size of dimension`
- `step=1`

In [None]:
prime_nums[0:3]     # get first three elements

In [None]:
prime_nums[2:]      # get third element to end of list

In [None]:
prime_nums[:3]      # get first element to third element

In [None]:
prime_nums[::-1]    # reverse a list

In [None]:
prime_nums[::2]     # every other element

Lists can also contain other list values. 

In [None]:
a_pile = [['cat', 'bat'], [10, 20, 30, 40, 50]]

a_pile[0]

## Exercise 1

Access and print the second value in the first list in `a_pile`.

`a_pile = [['cat', 'bat'], [10, 20, 30, 40, 50]]`

In [None]:
# Solution



### 2.1.2 Modifying list Objects
* List objects are mutable
* We can add, remove, or change elements

In [None]:
names = ["john", "paul", "george", "ringo"]

In [None]:
names.append("yoko")
names

In [None]:
names.remove("george")
names

In [None]:
names.insert(1, "freddie")
names

In [None]:
names[3] = "pete"    # we can modify with assignment
names

### 2.1.3 List Concatenation and Replication

The `+` operator can combine two lists to create a new list value in the same way it combines two strings into a new string value. The `*` operator can also be used with a list and an integer value to replicate the list.

In [None]:
[1, 2, 3] + ['A', 'B', 'C']

In [None]:
[1, 2, 3] * 3

### 2.1.4 List Containment

In [None]:
names

In [None]:
"paul" in names

In [None]:
9 in [5, 4, 3]

## Exercise 2

Using the methods we covered, update last week's grocery list for your current week of shopping.

* last week: `['milk', 'bread', 'eggs', 'cheese', 'cereal', 'apples']`
* this week: `['milk', 'bread', 'yogurt', 'cereal', 'oranges', 'hummus']`

In [None]:
# Solution



### 2.1.5 Sorting Lists

Lists of number values or lists of strings can be sorted with the sort() method.

In [None]:
spam = [2, 5, 3.14, 1, -7]
spam.sort()
spam

In [None]:
spam = ['ants', 'cats', 'dogs', 'badgers', 'elephants']
spam.sort()
spam

In [None]:
spam.sort(reverse=True)
spam

You cannot sort lists that have both number values and string values in them, since Python doesn’t know how to compare these values. Python will throw a TypeError error.


In [None]:
spam = [1, 3, 2, 4, 'Kallie', 'Alice', 'Bob']
spam.sort()
spam

## Exercise 3

How would you sort the list, spam, such that sorted numbers appear before sorted letters?

`spam = [1, 3, 2, 4, 'Alice', 'Bob']` to `spam = [1, 2, 3, 4, 'Alice', 'Bob']`

In [None]:
# Solution




`sort()` uses “ASCIIbetical order” rather than actual alphabetical order for sorting strings. This means uppercase letters come before lowercase letters. Therefore, the lowercase a is sorted so that it comes after the uppercase Z. 

In [None]:
spam = ['a', 'z', 'A', 'Z']
spam.sort()
spam

If you need to sort the values in regular alphabetical order, pass `str.lower` for the `key` keyword argument in the `sort()` method call. This causes the `sort()` method to treat all the items in the list as if they were lowercase without actually changing the values in the list.

In [None]:
spam = ['a', 'z', 'A', 'Z']
spam.sort(key=str.lower)
spam

### 2.1.6 Going back to strings...

In [None]:
string = "There was no possibility of taking a walk that day."
words = string.split()
words

In [None]:
' '.join(words)

In [None]:
' (:D) '.join(words)

## 2.2 The array Type
* A single array can store 1 type of element
* Can't have mixed type arrays
* This is different from list objects
* Can take on any number of dimensions, but 2D is most common

### 2.2.1 Packages in Python
* Part of NumPy package
* Python's strength is its ecosystem of packages
* A package is just a set of functions and objects

Some useful Python packages from the Python Standard Library:
- math
- os
- sys
- re
- pickle
- time
- datetime

(More here: <https://docs.python.org/3/library/>)

Other useful Python packages:
- numpy
- pandas
- scipy
- matplotlib
- requests
- nltk
- SQLalchemy
- sklearn

### 2.2.2 NumPy Package¶
* de facto standard for numerical problems
* Fast!!!
* Lots of functions for linear algebra
* Exceptionally popular

### 2.2.3 Importing a package

In [None]:
import numpy
numpy.array([7, 54, 3, 3, 6, 10, 9])
a

In [None]:
import numpy as np       # this is the most commonly used convention for numpy
a = np.array([7, 54, 3, 3, 6, 10, 9])  
a

In [None]:
from numpy import array
a = array([7, 54, 3, 3, 6, 10, 9])
a

### 2.2.4 Array Basics

In [None]:
c = np.arange(15)
c

In [None]:
help(np.arange)

In [None]:
c = c.reshape(3, 5)
c

In [None]:
# the number of axes (dimensions) of the array
c.ndim

In [None]:
# the dimensions of the array
c.shape

In [None]:
# the total number of elements of the array
c.size

In [None]:
# an object describing the type of the elements in the array
c.dtype

One can create or specify dtype’s using standard Python types. Additionally NumPy provides types of its own. `numpy.int32`, `numpy.int16`, and `numpy.float64` are some examples.

### 2.2.5 Indexing

In [None]:
a

In [None]:
a[0]             # array indexing is similar to list indexing

In [None]:
a[::-1]         # reverse an array

In [None]:
c

In [None]:
c[2, -1]        # index the value in the third row and last column

In [None]:
c[0, :]         # access a single row

In [None]:
c[:, 0]         # access a single column

### 2.2.6 Computation with Arrays 

In [None]:
M = np.random.randint(0, 10, (3, 4))
M

In [None]:
M.sum()          # sum all elements in array

In [None]:
M.min(axis=0)    # column mins

In [None]:
M.max(axis=1)    # row maxes

In [None]:
M * 2            # elementwise multiplication

# Exercise 5

Get the row sums for the following array:

`foo = np.random.randint(10, (5, 4))`

In [None]:
# Solution



## 2.3 The dict Type
* Different from list and array
* Stores "key" and "value" pairs
* Can store mixed types
* Fast lookup table

### 2.3.1 Creating dict

In [None]:
phone_nums = {}

In [None]:
phone_nums["jones"] = "555-555-555"

In [None]:
phone_nums["lee"] = "555-444-333"

In [None]:
phone_nums

In [None]:
phone_nums["jones"]        # "jones" and "smith" are keys

In [None]:
cities = {'BOS': 'Boston',
          'NYC': 'New York City',
          'LAX': 'Los Angeles'}
cities

### 2.3.2 Adding Entries of Arbitrary Type

In [None]:
phone_nums[16] = 'abc'

In [None]:
phone_nums

In [None]:
phone_nums[16]

In [None]:
phone_nums['smith'] = ['999-999-9999', '1-800-666-6666']

In [None]:
phone_nums

In [None]:
phone_nums['smith'][1]

### 2.3.3 Dictionary methods

In [None]:
cities.keys()

In [None]:
cities.values()

In [None]:
cities.items()

In [None]:
if 'NYC' in cities:
    print('yes, NYC is a key in the cities dictionary')

## 2.4 The `tuple` Type
 
- Store mixed-type data
- Similar to `list` type, but immutable
- There are speed advantages to using a tuple rather than a list
    - This can be important if you are working with large amounts of data, like a text corpus of millions of documents

In [None]:
a = ('hello', 'world', 'goodbye', 1, 2, 3)
type(a)

In [None]:
b = 'hello', 'world', 'goodbye', 1, 2, 3
type(b)

### 2.4.1 Indexing and Slicing a `tuple`

In [None]:
a[0]                # get first element

In [None]:
a[::2]              # get every other

## 2.5 The `set` Type

- Sets store unique elements
- Can store mixed type data
- Support set operations (e.g., union, intersection, difference)
- Fast!!!

In [None]:
# Can be constructed using set() or {}

x = set([5, 4, 3, 5])
print(x)

y = {5, 6, 3, 2}
print(y)

In [None]:
x - y     # in x but not in y

In [None]:
y - x     # in y but not in x

In [None]:
x | y     # union (in x or y)

In [None]:
x & y     # intersection (in x and y)

In [None]:
x ^ y     # in x or y, not both (symmetric difference)

## Exercise 6

Suppose we have the two documents, chopped into lists of words, below. Use set operations to get the unique words that appear in both documents.

```
doc1 = ["rivera", "took", "care", "of", "his", "beloved", "kahlo", "and", "she", "took", "care", "of", "him"]
doc2 = ["kahlo", "and", "rivera", "were", "beloved", "artists"]
```

In [None]:
# Solution




## 2.6 The `DataFrame` Type

* Part of Pandas package
* Spreadsheet or table-like representation of data
* de facto standard for machine learning problems
* Like a 2D array, but can store mixed types
* Built on numpy arrays
* Lots of functions for cleaning and massaging data, grouping, aggregations, plotting
* Exceptionally popular

![title](./panda.jpg)

### 2.6.1 Reading a CSV file in as a `DataFrame`

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('./usnewshealth.txt', sep='|', header=None, names=['ID', 'Date', 'Tweet'])
df.head()

In [None]:
help(pd.read_csv)

### 2.6.2 Basic Operations

In [None]:
df.head(10)      # view top rows

In [None]:
df.tail()     # view bottom rows

In [None]:
df.columns     # display the columns

In [None]:
list(df)       # display the columns as a list

In [None]:
df.shape       # display the dimensions

In [None]:
df.dtypes

In [None]:
df.describe()   # quick statistic summary of the numeric columns in your data

### 2.6.3 Sorting `DataFrames`

In [None]:
df.sort_values(by='Tweet').head(10)                # sort by values in column B

### 2.6.4 Selection

#### Getting Row/Column slices

In [None]:
df['Tweet']    # slice a column

In [None]:
df.Tweet         # alternate notation for slicing a column

In [None]:
df[['ID','Tweet']]         # slice two columns

#### Selection by Position using `.iloc[]`

In [None]:
df.iloc[0:5, :]               # slice rows explicitly

In [None]:
df.iloc[:, 1:3]               # slice columns explicitly

#### Selection by Boolean Indexing

In [None]:
df.Tweet.str.contains('taxes')    # create a boolean mask

In [None]:
df[df.Tweet.str.contains('taxes')]    # select data using boolean mask

### 2.6.5 String Methods

Series (a pandas column) is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array.

In [None]:
s = pd.Series(['Aardvark', 'Bear', 'Cat', 'dog', 'cat', 'LION'])
s

In [None]:
s.str.lower()

In [None]:
s.str.lower().replace(r'[aeiou]', '-', regex=True)

In [None]:
df.Tweet.str.lower()

### 2.6.6 Joining Dataframes
- Can do database style joins (inner, outer, left, right) to merge two or more dataframes
    - inner join keeps values in common (intersection)
    - outer join keeps all values (union)
    - left join keeps values in left dataframe
    - right join keeps values in right dataframe

In [None]:
left_df = df.iloc[0:10, :]
left_df = left_df.assign(Category=['positive']*5 + ['negative']*4 + ['very positive'])
left_df

In [None]:
right_df = pd.DataFrame({'Category': ['positive', 'negative', 'neutral'],
                         'Weight': [1, -1, 0]})
right_df

In [None]:
help(pd.merge)

In [None]:
innerjoin_df = pd.merge(left_df, right_df, how='inner', on='Category')
innerjoin_df

In [None]:
outerjoin_df = pd.merge(left_df, right_df, how='outer', on='Category')
outerjoin_df

In [None]:
leftjoin_df = pd.merge(left_df, right_df, how='left', on='Category')
leftjoin_df

In [None]:
rightjoin_df = pd.merge(left_df, right_df, how='right', on='Category')
rightjoin_df

# 2.7 Casting from one container type to another

In [None]:
a = ['hello', 'hi', 'hey']

# cast list to tuple
b = tuple(a)
b

In [None]:
# cast tuple to list
c = list(b)
c

In [None]:
# cast list to array
b = np.asarray(a)
b

In [None]:
# cast array to list
c = b.tolist()
c

In [None]:
# cast array to dataframe
L = pd.DataFrame(M, columns=['a','b','c','d'])
L

In [None]:
# cast dataframe to array
K = L.values
K

In [None]:
# cast lists to dictionary
codes = ['NYC', 'BOS', 'LAX', 'PHL', 'WAS']
cities = ['new york', 'boston', 'los angeles', 'philadelphia', 'washington dc']

d = dict(zip(codes, cities))
d

In [None]:
# cast dictionaries to dataframe
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
d = pd.DataFrame.from_dict(data)
d

In [None]:
# cast dataframe to dictionary
print(d.to_dict())

# Recap
- There are many Python container types and they are variably suited to different programming tasks
- You can cast container types to different container types (ex. lists to arrays)
- Always read the help() to understand new Python code
- Read and debug error tracebacks, starting at the line where you code error'd.