# Intro to Python 3: Dictionaries and Dataframes

While organizing data into tuples and lists is fine for our use, Python provides several more sophisticated ways of labeling and archiving data. In this tutorial, we will cover other ways of storing data: **dictionaries** and **dataframes**.

Additionally, this tutorial will cover how to import and run **packages** in Python, as well as provide a brief primer into using **Regular Expressions**

## Dictionaries

Python's dictionaries are similar to tuples and lists when it comes to data structures. **Dictionaries** are structured by a combination of **keys** and **values**. Dictionaries are commonly used for counting tokens in a text document, but there are several other applications.

But why use a dictionary over a series of nested lists or tuples? Dictionaries come with a large number of indexing commands. These commands are especially helpful when you're only concerned about a string have one possible value. This is to say that if you're concerned about the vectors and matrices that constitute the word "the" in a text document, dictionaries are a bad bet.

Below, I'll create a dictionary using squiggly brackets.

In [1]:
#dict#
flavors = {"blueberry": 2.99, "cosmic key lime": 3.39, "orange": 3.29, "wild berry": 2.00}


Apart from the less inspiring names, this flavors dictionary should look eerily similar to lists from previous execises. If you'll recall, we needed to use the **min** and **max** function in conjunction with lambda to find out the most expensive flavor. Below we'll compare how similar finding the cheapest toaster pastry flavor is when using a dictionary entry or a nested tuple.

In [2]:
#nested_tuples#
flavors_l = [("blueberry", 2.99), ("cosmic key lime", 3.39), ("orange", 3.29), ("wild berry", 2.00)]

In [3]:
#using dict#
min(flavors.keys(), key=(lambda k: flavors[k]))

'wild berry'

In [4]:
#using tuple#
min(flavors_l, key = lambda entry: entry[1])[0]


'wild berry'

A few differences are worth noting. Rather than using brackets to determine which subsection of flavors_1 we're refering to with tuples (entry[1]), dictionary entires are pairs of keys and values. Additionally, sincle the min function returns the tuple in flavors_1, we need to add the [0] at the end of the command to return "wild berry."

While these differences may seem negligible, a dictionary's strengths really shine when we know the keyword we would like a value for. Compare the two processes below for finding out how much a blueberry toaster pastry costs.

In [5]:
#using dict@
flavors["blueberry"]

2.99

In [6]:
#using tuples#
for entry in flavors_l:
    if "blueberry" in entry[0]:
        print(entry[1])


2.99


Dictionaries are slightly more intuitive to use when we know the keyterm we want information on. If "blueberry" were the last tuple in a very large list, it would take longer to loop through that list instead of using a dictionary's key/value pair. Indeed, it's important to imagine applications where you're dealing with thousands of words that are featured multiple times; looping would be a bit like finding a needle in a haystack straw by straw. 

## Regular Expressions

**Regular Expressions** are patterns used find specific strings of text within large strings. For instance, imagine you were presented with data in the following way: "blueberry toaster pastries sell for two dollars and ninety-nine cents," "orange toaster pastries sell for three dollars and twenty-nine cents", "ice cream sundae toaster pastries sell for four dollars." If your job was to turn this data into a dictionary that listed flavor and price, you would need to identify and convert textual patterns.

Let's start with flavor. Looking at the data below, we can tell that words that appear before the string "toaster" typically contain flavor information. The first thing we want to do is to break these strings into individual tokens. To do this, we're going to first import the re package in Python and then create a word patter.


In [7]:
textual_data = ["Blueberry toaster pastries sell for two dollars and ninety-nine cents.", "Orange toaster pastries sell for three dollars and twenty-nine cents.", "Ice cream sundae toaster pastries sell for four dollars."]

In [8]:
import re

word_pattern = re.compile("\w[\w\-\']*\w")

Re is the package used for regular expressions. In Python, you can import packages using the command **import**. If you only needed a specific part of a package, you can clarify **from re import re.compile**.

Packages are a series of predefined functions that are not available in the basic version of Python. It is often preferable to use the same packages as others rather than write redundant functions; online support communities, optimization, and ease of use should not be taken for granted!

Now that we've imported re, we might ask what this word_pattern means? \w means any word character. In brackets, we clarify that we're including \w to also mean that once character can be a hypen or an apostrophe. The * implies that the pattern will continue until it reaches a character that is not a word character. https://regex101.com provides a helpful interface for learning more about and testing different RegEx patterns.

For this instance, word_pattern is similar to the **split()** command used earlier in that it will allow us to turn sentences into tokens. There is, however, a notable difference.

In [9]:
#with regex#
for line in textual_data:
    print(word_pattern.findall(line))

['Blueberry', 'toaster', 'pastries', 'sell', 'for', 'two', 'dollars', 'and', 'ninety-nine', 'cents']
['Orange', 'toaster', 'pastries', 'sell', 'for', 'three', 'dollars', 'and', 'twenty-nine', 'cents']
['Ice', 'cream', 'sundae', 'toaster', 'pastries', 'sell', 'for', 'four', 'dollars']


In [10]:
#with split#
for line in textual_data:
    print(line.split())

['Blueberry', 'toaster', 'pastries', 'sell', 'for', 'two', 'dollars', 'and', 'ninety-nine', 'cents.']
['Orange', 'toaster', 'pastries', 'sell', 'for', 'three', 'dollars', 'and', 'twenty-nine', 'cents.']
['Ice', 'cream', 'sundae', 'toaster', 'pastries', 'sell', 'for', 'four', 'dollars.']


If you look at "cents", you'll see the split() method retained punctation. In sentences that feature far more punctuation, this can compromise our ability to match words ["cents" does not equal "cents."] 

We can expand our regex method from above to extract flavor information.

In [11]:
for line in textual_data:
    flavors = []
    line_w_tokens = word_pattern.findall(line)
    for token in line_w_tokens:
        if token != "toaster":
            flavors.append(token)
        else:
            break
    print(" ".join(flavors).lower())

blueberry
orange
ice cream sundae


In [12]:
string2num_dict = {"zero":0, "one":1, "two":2, "three":3, "four":4,"five":5,"six":6,"seven":7,"eight":8, "nine":9, "ten":10, "eleven":11,
                  "twelve":12, "thirteen":13, "fourteen":14, "fifteen":15, "sixteen":16, "seventeen":17, "eighteen":18, "nineteen":19,
                  "twenty":20,"thirty":30, "forty":40, "fifty":50, "sixty":60,"seventy":70, "eighty": 80, "ninety":90}

In [13]:
def dollars_and_cents(token):
    if "-" in token:
        separated = token.split("-")
        tens = string2num_dict[separated[0]]
        return(tens + string2num_dict[separated[1]])
    else:
        return(string2num_dict[token])

In [14]:
#test#
dollars_and_cents("twenty-one")

21

In the function above, words that have a hypen in them are split. Since the dictionary would be very long if we included entries like {"twenty-one":21}, splitting the word in two allows us to be more efficient. Once both parts of the string have been given numerical values, the two are added back together. If we include this function in the for-loop from earlier, we should be able to get all of the data we need.

In [15]:
def itemizer(line):
    line_w_tokens = word_pattern.findall(line)
    cost = 0
    for position in range(len(line_w_tokens)):
        if line_w_tokens[position] == "toaster":
            flavor = " ".join(line_w_tokens[:position])
        elif line_w_tokens[position] == "dollars":
            dol_amt = dollars_and_cents(line_w_tokens[position -1])
            cost += float(dol_amt)
        elif line_w_tokens[position] == "cents":
            cen_amt = dollars_and_cents(line_w_tokens[position -1])
            cost += cen_amt/100
        else:
            pass
    print((flavor.lower(), cost))

In [16]:
for line in textual_data:
    itemizer(line)

('blueberry', 2.99)
('orange', 3.29)
('ice cream sundae', 4.0)


## Dataframes

Imagine that you work for a coffee shop and receive the following e-mails from a boss:

"A new shipment of blueberry toaster pastries will be arriving on Friday. Each shipment contains twelve packages. We currently have four packages in stock."

"The orange toaster pastries will be here by Thursday. We currently have six packages in stock."

"I cancelled the shipment for ice cream sundae toaster pastries because they have not sold very well. We still have twelve in stock. On Thursday, we will reduce their price from four dollars to three dollars."

Using a dictionary to store this data would not make much sense; there are now many numbers to keep track of for each toaster pastry. While using a series of tuples might make sense, they lack the legibility that would make them helpful for another employee who was also hoping to keep track of inventories.

Let's begin, however, by modifying our existing code to see how much information we can glean from these strings.

In [17]:
shipment_data = []

e_mails = ["A new shipment of blueberry toaster pastries will be arriving on Friday. Each shipment contains twelve packages. We currently have four packages in stock.",
          "The orange toaster pastries will be here by Thursday. We ordered two shipments containing six packages each. We currently have six packages in stock.", 
           "I cancelled the shipment for ice cream sundae toaster pastries because they have not sold very well. We still have twelve packages in stock."]


In [18]:
for line in e_mails:
    itemizer(line)

('new shipment of blueberry', 0)
('the orange', 0)
('cancelled the shipment for ice cream sundae', 0)


So using the existing code as is, we were able to still get some data about the relevant flavors. Still, this hasn't gotten us half the data that we want. Ideally, our code will tell us the following:
1.) What flavor? 2.) How many are in stock? 3.) When does/Will a new shipment arrive? 4.) How many will be in stock after the shipment? 

If we think of these questions as a spreadsheet, we would want five columns:
flavor, stock, shipment, stock_after_shipment, price

Before we build a function, we should identify the tokens in the strings above that are important in relation to these questions.

1.) Flavor occurs before the word **toaster** and after a preposition or an article.
2.) The token before **packages in stock**
3.) The token will be a day of the week. We can probably make a dictionary for this.
4.) **shipment** or **shipments** + **contain** or **containing** We then need to add this to existing stock.



In [19]:
#for toaster#
#common preps from
common_prepositions_and_articles = ["of", "in", "to", "for", "with",
                                    "on","at","from","by","about","as",
                                   "into", "like","through","after", "over",
                                   "between", "out", "against", "during","without",
                                   "before", "under","around", "among", "a",
                                   "an", "the"]
def find_flavor(line):
    line_w_tokens = word_pattern.findall(line)
    flavors = []
    start = 0
    for token in range(len(line_w_tokens)):
        if line_w_tokens[token] != "toaster":
            flavors.append(token)
        else:
            break       
    for entry in reversed(flavors):
        if line_w_tokens[entry].lower() in common_prepositions_and_articles:
            start = entry + 1
            break
    return(" ".join(line_w_tokens[start:len(flavors)]))

In [20]:
for line in e_mails:
    find_flavor(line)

The code above modifies our original function to find flavors quite significantly. The largest addition is the list of common prepositions and articles. While the earlier sentences had the names of the flavors at their beginning, letting us say that the flavor was equal to everything until the token "toaster", these sentences all begin differently: *A new shipment of blueberry*, *The orange*, and *I cancelled the shipment for ice cream sundae*. By adding a list of articles and prepositions, we are able to say where flavor description ends. *A new shipment **of** blueberry*, ***The** orange*, and *I cancelled the shipment **for** ice cream sundae*.

If you think of this as a sheet of paper we need to cut twice, our first direction tells our hands with scissors (or scissorhands) to move from left to right along the sheet until we come to the token toaster. This is how **iteration** typically works. Once we've made that cut the **reversed** function allows us to move right to left until we come to a preposition or article. Once we know where that point is, we are able to, in our final command, say that we only want the tokens between those two incisions.

In [21]:
#for in_stock#
def find_in_stock(line):
    line_w_tokens = word_pattern.findall(line)
    for position in range(len(line_w_tokens)):
        if line_w_tokens[position] == "stock":
            return(string2num_dict[line_w_tokens[position-3]])

In [22]:
for line in e_mails:
    find_in_stock(line)

This code is much more simple than that last! It looks for the token "stock" and then chooses the token that's three positions before that (skipping "packages" and "in"). It then uses that token position and turns it into a numeral using the dictionary from earlier.

While this code works for this example, what issues could you see it running into if our dataset was larger and more varied than three sentences? What are the drawbacks of designing code around specific rules and keywords?

In [23]:
#for day_of_week#
d_o_w = {"Monday":1, "Tuesday":2, "Wednesday":3, "Thursday":4, "Friday":5, "Saturday":6, "Sunday":7}

def find_dow (line):
    line_w_tokens = word_pattern.findall(line)
    dow = 0
    for key in d_o_w:
        if key in line_w_tokens:
            dow += d_o_w[key]
        else:
            pass
    return(dow)

In [24]:
for line in e_mails:
    find_dow(line)

This code uses a dictionary to search the tokens for days of the week. Rather than return the matching day of the week, this code has the dictionary value added to a variable called dow. The reason for this is that we know one of the sentences involves a cancellation. Rather than returning no date, it's better to insert a 0; Python is sometimes finnicky about blank entries or None types.

In [25]:
#for stock#

In [26]:
def find_amt (line):
    line_w_tokens = word_pattern.findall(line)
    multiplier = 1
    quantity = 0
    for position in range(len(line_w_tokens)):
        if line_w_tokens[position] == "shipments":
            no_shipments = string2num_dict[line_w_tokens[position-1]]
            multiplier *= no_shipments
        if "shipment" in line_w_tokens[position]:
            if "contain" in line_w_tokens[position + 1]:
                quantity += string2num_dict[line_w_tokens[position + 2]]
    return(quantity * multiplier)
 

In [27]:
for line in e_mails:
    find_amt(line)

This last big of code is probably the most complex so far. The multiplier and quantity variables allow us to determine what the final amount will be. When the word "shipments" appears, we can assume that the multiplier will be necessary. If we see the word "shipment" followed by "contain" (which can be contains or containing), we have reason to believe that the next word will be the quantity per shipment. The code ends with the quantity being multiplied by the multiplier.

Now that we have these functions together, we can start assembling our data. A good intermediate step might be to see if there's any redundancy in the code we can make universal. For instance, the first line of each of our functions uses a Regular Expression to break the sentence down into tokens. Rather than do that four times, it would make more sense to do it once. One of the exercises below will ask you to do this.

For now, let's turn our sentences into tuples!

In [28]:
inventory_data = []

for line in e_mails:
    inventory_data.append((find_flavor(line), find_in_stock(line), find_dow(line), find_amt(line)))

In [29]:
inventory_data

[('blueberry', 4, 5, 12), ('orange', 6, 4, 12), ('ice cream sundae', 12, 0, 0)]

So beautiful, and yet the presentation leaves so much to be desired. Using Pandas, we're going to turn this into a dataframe.

In [35]:
import pandas as pd

dataframe = pd.DataFrame(inventory_data, columns = ["flavor", "stock", "shipment", "quantity in shipment"])

In [36]:
dataframe

Unnamed: 0,flavor,stock,shipment,quantity in shipment
0,blueberry,4,5,12
1,orange,6,4,12
2,ice cream sundae,12,0,0
