# Python Data Structures: Dictionaries and Data Frames

**Learning Objectives**
* Introduce dictionaries, and data frames.
* Practice interacting with and manipulating these data structures.

Dictionaries and DataFrames are two other key types of data structure. We will cover each of these in the sections below.


## Dictionaries: Key-Value Structures

Dictionaries are organized on the principle of key-value pairs. The **keys** can be used to access the **values**. They're most useful when you have unordered data organized in pairs. This occurs, for example, in storing metadata (data describing other data).

Keys can be ints, floats, or strings, and are unordered. Values, however, can be any data type.

Dictionaries are specified in Python using curly braces, with colons separating keys and values. 

Let's take a look at an example dictionary:

In [None]:
example_dict = {
    "name": "Forough Farrokhzad",
    "year of birth": 1935,
    "year of death": 1967,
    "place of birth": "Iran",
    "language": "Persian"}

example_dict['year of birth']

Like lists, dictionaries have their own methods. One of them is the `keys()` method. What is the type of the output of `.keys()`?

In [None]:
print(example_dict.keys())
type(example_dict.keys())

`dict_keys` is a type that we haven't encountered before, so it will be hard to work with this type directly. However, recall that we can use type conversion to change the type of a variable. We can **cast** (or change the type of) the dictionary keys to a list, which is a type that we are more familiar with:

In [None]:
list(example_dict.keys())

## Challenge 1: Creating a Dictionary

Create a dictionary `fruits` using the following lists as *values*. Choose an appropriate string for each of the *keys* of the dictionary. Print the keys in the dictionary.

In [None]:
fruit = ['apple', 'orange', 'mango']
length = [3.2, 2.1, 3.1]
color = ['red', 'orange', 'yellow']



# YOUR CODE HERE

Dictionaries are useful for hierarchical storage of data (and can even be nested just like lists!). They are also often used to initialize data frames, a useful data structure for tabular data, and essential for data scientists.

## Data Frames

A common data structure you've likely already encountered is tabular data. Think of an Excel sheet: each column corresponds to a different feature of each datapoint, while rows correspond to different samples.

In scientific programming, tabular data is often called a "data frame". In Python, there a specialized library called `pandas` which contains an object `DataFrame` that implements this data structure.

We're going to explore `pandas` more closely in Part 3, but let's try creating a `DataFrame` object right now. 


First, we need to create a dictionary:

**Note:** You can also substitute in your answer for Challenge 1 below.

In [None]:
fruit = ['apple', 'orange', 'mango', 'strawberry', 'salmonberry', 'thimbleberry']
size = [3, 2, 3, 1, 1, 1]
color = ['red', 'orange', 'orange', 'red', 'orange', 'red']

fruits = {
    'fruit': fruit,
    'size': size,
    'color': color}

Next, we import the `pandas` **library** (We will cover libraries in more detail in Part 3) and pass in the dictionary to the `pd.DataFrame()` function, storing the result in a variable called `df`.

In [None]:
import pandas as pd

df = pd.DataFrame(fruits)
df

The keys became column names and the values became cells in the `DataFrame`. In addition, there is an **index** on the left that keeps track of the row.

Objects can also have **attributes**, or variables associated with the data type. We can get the number of columns and rows with `df.shape`, an attribute of the dataframe. 

**Question:** How many rows and columns does this dataframe have? 

In [None]:
df.shape

## Challenge 2: Initializing a DataFrame

The following code gives an error. Why does it have an error? What are some ways to fix this?

In [None]:
fruit = ['apple', 'orange']
length = [3.2, 2.1, 3.1]
color = ['red', 'orange', 'yellow']

fruit_dict = {
    'fruit': fruit,
    'length': length,
    'color': color}

df_fruit = pd.DataFrame(fruit_dict)

## Working with DataFrames

Pandas has hundreds of useful ways for us to work with DataFrames. We will cover a couple of general topics here and in Part 3, but for more on pandas, consider the Python Data Wrangling workshop. 


We can choose a single column by selecting the name of that column. `pandas` calls this a `pd.Series` object. The act of obtaining a particular subset of a data frame is often referred to as **slicing**. This uses bracket notation to select part of the data.

Check the type of the slice below:

In [None]:
# Bracket notation to choose a column
df['fruit']

`DataFrame` objects also have methods, including those for [merging](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html?highlight=merge#pandas.DataFrame.merge), [aggregation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html), [nulls](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html), and others. Many of these functions operate on a single column of the DataFrame. For example, we can identify the number of unique values in each column by using `.nunique()`, and what those unique values are by using `.unique()`:

In [None]:
#number of unique colors in the df
print(df['color'].nunique())


#unique colors in the df
print(df['color'].unique())

## Challenge 3: `value_counts()`

There is another pandas function `.value_counts()` which can be used to help organize the information provided by `unique()`. Read the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) and apply `value_counts()` to the `df` variable. How many 'red' and 'orange' fruits are in the DataFrame?

In [None]:
## YOUR CODE HERE.