# Python Data Structures: Lists, Dictionaries, and Data Frames

**Learning Objectives**
* Introduce some major data structures in Python: Lists, dictionaries, and data frames.
* Practice interacting with and manipulating these data structures.


Data structures are objects that organize and store data in a useful way. They're a bedrock of data analysis in programming. We'll cover three of the fundamental data structures in this lesson: lists, dictionaries, and data frames.


## Lists: Ordered Data Structures

The first data structure we consider is a **list**. Lists are a collection of **ordered** items. Lists have a length, and the constituent items can be **indexed** based on their positions.

They're most useful when storing a collection of values, when order is important. One nice thing about lists is that they can contain different types of data. For example, the entries of a list can be integers, floats, strings, and even other lists!

We specify a list with square brackets: `[]`

In [2]:
country_list = ["Afghanistan", "Canada", "Thailand", "Denmark", "Japan"]
type(country_list)

list

In [4]:
len(country_list)

5

You can index a list using square brackets followng the list name, using the notation `[start:stop]`. Using a colon indicates that you want all entries between the two indices. If one side of the colon is empty, it indicates using one end of the list as the starting or ending points.

Python is *zero*-indexed, meaning the first entry has index zero, not one! In addition, the `stop` index indicates 'up to but not including'. So, in `list[start:stop]`, `list[stop]` is not included.

In [5]:
print(country_list[0])
print(country_list[1:4])
print(country_list[1:])
print(country_list[:4])

Afghanistan
['Canada', 'Thailand', 'Denmark']
['Canada', 'Thailand', 'Denmark', 'Japan']
['Afghanistan', 'Canada', 'Thailand', 'Denmark']


You can also use negative numbers to indicate starting points relative to the last entry in the list:

In [6]:
print(country_list[-1])
print(country_list[-4:-1])
print(country_list[-4:])
print(country_list[:-1])

Japan
['Canada', 'Thailand', 'Denmark']
['Canada', 'Thailand', 'Denmark', 'Japan']
['Afghanistan', 'Canada', 'Thailand', 'Denmark']


## Challenge 1: Slicing Lists

Using the lists in the next cell:

1. What does `thing[start:stop]` do? What is the output?
2. Write three different ways to slice the string from 'elephant' to the end.

In [7]:
thing = [1, 3, 8, 'elephant', 'banana', 2]
start = 2
stop = 5

### List Methods

As we discussed in Part 1, objects can have methods associated with their data type that are accessed via dot notation (`object.method()`).

Methods are functions that operate specifically on a particular data type. Lists have their own methods which perform operations specific to the structure of lists. The most common method is the `append()` method, which adds an item to the end of a list.

In [8]:
print(country_list)

['Afghanistan', 'Canada', 'Thailand', 'Denmark', 'Japan']


In [9]:
country_list.append('USA')

In [10]:
print(country_list)

['Afghanistan', 'Canada', 'Thailand', 'Denmark', 'Japan', 'USA']


Note that the `append()` method operates **in-place**: it modifies the object it is applied to. This is not always the case in Python - some methods return an object that must be stored in its own variable.

There are many other useful list methods. Use the [documentation](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists) to investigate available methods for dealing with lists.

## Challenge 2: Appending to Lists

We've created a list called `thing` in the cell below.

1. Append the following values to the list, individually: `'apple'`, `8`, and `9`. Print the ensuing list out.
2. Make a new list called `thing2` consisting of the values `'apple'`, `8`, and `9`. Append `thing2` to `thing`. How does the output differ from the output from the previous part?
3. Look at the [documentation](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists) for the list method `.extend()`. Is there a way to rewrite your answer to (2) to use extend? How does that compare to the outputs of the previous two parts?
4. What is one situation where you would use `append` and one where you would use `extend`?

**Hint**: *Iterable* in Python means an object with multiple values that can be iterated through (including lists, tuples, and even strings).

In [11]:
#1
thing = [1, 3, 8, 'elephant', 'banana', 2]

# YOUR CODE HERE

#2
thing = [1, 3, 8, 'elephant', 'banana', 2]
thing2 = []


#3
thing = [1, 3, 8, 'elephant', 'banana', 2]
thing2 = []


## Dictionaries: Key-Value Structures

Dictionaries are organized on the principle of key-value pairs. The **keys** can be used to access the **values**. They're most useful when you have unordered data organized in pairs. This occurs, for example, in specifying metadata (data describing other data).

Keys can be ints, floats, or strings, and are unordered. Values, however, can be any data type.

Dictionaries are specified in Python using curly braces, with colons separating keys and values. Let's take a look at an example dictionary.

In [12]:
example_dict = {
    "name": "Forough Farrokhzad",
    "year of birth": 1935,
    "year of death": 1967,
    "place of birth": "Iran",
    "language": "Persian"}

example_dict['year of birth']

1935

Like lists, dictionaries have their own methods. One of them is the `keys()` method:

In [14]:
print(example_dict.keys())

dict_keys(['name', 'year of birth', 'year of death', 'place of birth', 'language'])


Remember how we did type conversion? We can do the same thing here, and cast the dictionary keys to a list, which we can then iterate through:

In [17]:
list(example_dict.keys())

['name', 'year of birth', 'year of death', 'place of birth', 'language']

## Challenge 3: Creating a Dictionary

Create a dictionary `fruits` with the following lists. Use the names of the list for the keys of the dictionary. Print the list of keys of the dictionary.

In [19]:
fruit = ['apple', 'orange', 'mango']
length = [3.2, 2.1, 3.1]
color = ['red', 'orange', 'yellow']

Dictionaries are useful for hierarchical storage of data (and can even be nested!). They are also often used to initialize data frames, a useful data structure for tabular data.

## Data Frames

A common data structure you've likely already encountered is tabular data. Think of an Excel sheet: each column corresponds to a different feature of each datapoint, while rows correspond to different samples.

In scientific programming, tabular data is often called a "data frame". In Python, there a specialized library called `pandas` which provides tools to create and manipulate data frames.

We're going to explore `pandas` more closely in Part 3, but let's try creating a `pandas` `DataFrame` object right now. We'll do this by creating a dictionary:

In [20]:
fruit = ['apple', 'orange', 'mango', 'strawberry', 'salmonberry', 'thimbleberry']
size = [3, 2, 3, 1, 1, 1]
color = ['red', 'orange', 'orange', 'red', 'orange', 'red']

fruits = {
    'fruit': fruit,
    'size': size,
    'color': color}

Next, we import the `pandas` library and pass in the dictionary to the `pd.DataFrame()` function, storing the result in a variable called `df`.

In [21]:
import pandas as pd

df = pd.DataFrame(fruits)
df

Unnamed: 0,fruit,size,color
0,apple,3,red
1,orange,2,orange
2,mango,3,orange
3,strawberry,1,red
4,salmonberry,1,orange
5,thimbleberry,1,red


The keys became column names and the values became cells in the `DataFrame`. In addition, there is an *index* on the left that keeps track of the row.

Objects can also have **attributes**, or variables associated with the datatype. We can get the number of columns and rows with `df.shape`, an attribute of the dataframe. How many rows and columns does this dataframe have? 

In [23]:
df.shape

(6, 3)

## Challenge 4: Initializing a DataFrame

The following code gives an error. Why does it have an error? What are some ways we could fix this?

In [31]:
fruit = ['apple', 'orange']
length = [3.2, 2.1, 3.1]
color = ['red', 'orange', 'yellow']

fruit_dict = {
    'fruit': fruit,
    'length': length,
    'color': color}

df_fruit = pd.DataFrame(fruit_dict)

ValueError: All arrays must be of the same length

### DataFrame Slicing and Methods

We can choose a single column by selecting the name of that column. `pandas` calls this a `pd.Series` object. The act of obtaining a particular subset of a data frame is often referred to as **slicing**. This uses bracket notation to select part of the data.

In [32]:
df

Unnamed: 0,fruit,size,color
0,apple,3,red
1,orange,2,orange
2,mango,3,orange
3,strawberry,1,red
4,salmonberry,1,orange
5,thimbleberry,1,red


In [33]:
# Bracket notation to choose a column
df['fruit']

0           apple
1          orange
2           mango
3      strawberry
4     salmonberry
5    thimbleberry
Name: fruit, dtype: object

In [27]:
# Specify each dimension of the data frame separately using the loc method
df.loc[:, 'fruit'] # Colon selects all rows

0           apple
1          orange
2           mango
3      strawberry
4     salmonberry
5    thimbleberry
Name: fruit, dtype: object

We can choose a row by using the `loc` method with the first entry: `df.loc[index, :]`.

In [34]:
# Select the first row
df.loc[0, :]

fruit    apple
size         3
color      red
Name: 0, dtype: object

In [35]:
# Select a single cell
df.loc[0, 'fruit']

'apple'

`DataFrame`s also have methods, including those for [merging](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html?highlight=merge#pandas.DataFrame.merge), [aggregation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html), [nulls](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html), and others. For example, we can identify the number of unique values in each column by using `nunique()`:

In [36]:
df.nunique()

fruit    6
size     3
color    2
dtype: int64

We can also count how many unique values of each type for a column using `df.value_counts()`:

In [37]:
df.value_counts(['color'])

color 
orange    3
red       3
dtype: int64

There are many many more methods and operations for Pandas DataFrames. Check out our Data Wrangling with Python workshop for more on DataFrames (and part 3 of this workshop1)

## Other Data Structures

There are many more data structures in Python that you may run across. A few include:

* **tuple**: Similar to a list, but values can't be changed. Tuples are **immutable**.
* **set**: An unordered list, which can only contain **unique** values.
* **range**: A sequence of numbers, often in an arithmetic sequence. 
* And many [more](https://docs.python.org/3/library/stdtypes.html#immutable-sequence-types)!


We often interact with these more often as the output of functions rather than writing them ourselves, but it's good to be aware of them. 