<a href="https://colab.research.google.com/github/amckenny/text_analytics_intro/blob/main/notebooks/02_basic_python_data_types_and_structures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 2 - Basic Python Data Types and Structures
---

The goal of this notebook is to review the Python data types and structures you're likely to come across in text analysis. As with the previous module, the idea here is not that you'll have a full understanding of each of these. Rather, the goal is familiarity - you know enough about these that you'll be able to recognize them and how to do basic operations with them when we really get into the text analytics.

Learning outcomes for this module:
* Solid understanding of four foundational data types: 'str', 'int', 'float', 'bool'
* A slightly deeper understanding of string ('str') operations
* Basic understanding of the 'tuple', 'list', and 'dict' data structures
* Understanding of packages/modules and how to invoke them.
* Passing familiarity with the 'Collections.Counter', 'NumPy.ndarray', and  'Pandas.DataFrame' data structures

## 2.1. Basic Data Types: Words and Numbers
---


If you completed Module 1, you've actually already seen all four of the basic data types:

1. You've seen **integers** ('int' in Python): which are numbers with no decimal points (e.g., 1, -27, 42)
2. You've seen **floats** ('float' in Python): which are numbers *with* decimal points (e.g., 1.0, -27.22, 42.4242)
3. You've seen **strings** ('str' in Python): which are a series of characters all strung together inside quotes (e.g., "Aaron", 'Spam', 'Hoosiers')
4. You've seen **booleans** ('bool' in Python): which are exclusively True and False.

Before we dig into each type of data, it's worth looking at how to get Python to tell you the type of something - whether you type it in manually or whether it's stored in a variable.

* `type(something)`

The 'type' function, shown above will return to you what the Python data type of the 'something' you pass to it is. For example: 

In [None]:
print(type("something"))

We sent the 'type' function a string (a series of characters), and it told us that it was a str (Python's way of denoting a string).

We could have just as easily stored the string in a variable and sent the 'type' function that variable:

In [None]:
what_am_i = "I'm a string"
print(type(what_am_i))

In the next code block, we'll see that type correctly identifies each of the different data types I identified above.

For clarity, here I use *f-strings* to print it out in a more readable format.

In [None]:
print(f"1 is a {type(1)}.")
print(f"1.0 is a {type(1.0)}.")
print(f"Spam is a {type('Spam')}")
print(f"True is a {type(True)}")

If you Python to confirm whether something is a certain specified data type rather than telling you what data type it is, you use:

`isinstance(something, data type)`

The logic behind this function name is that 'str', 'int' and so on are data types, and the things that contain that type are *instances* of that type. So this function basically answers the question, is this *something* an instance of type *data type*?

'isinstance' is a little different from the other functions that we've used thus far (e.g., 'print,' 'type'), because here we have to provide it with two bits of information. When we provide functions with multiple pieces of data, we separate them with a comma.

Let's try it:

In [None]:
isinstance("Spam", str)

The 'True' response makes sense. "Spam" is a string, and we asked if it was an instance of the 'str' type.

Let's try it with a variable...

In [None]:
num_of_pages = 42
isinstance(num_of_pages, float)

Here we created a variable that has a whole number of pages (decimals don't really make sense here). We then asked python whether this variable contains a float (a number with a decimal).

What Python said made sense. While both int and float are numbers, they are different *types,* and that is what isinstance is looking for.

Ok, with knowledge of how to probe the types of different bits of data, let's jump into our first two basic data types: Integers and Floats

### 2.1.1. Integers and Floats
---


Integers (int - whole numbers) and floats (float - numbers that have a decimal point) generally play fairly well together. And most of the things you can do with one, you can do with the other. And for many purposes we can treat them as interchangeable.

Consider the following:


In [None]:
print(1 + 3.5)

In [None]:
print(3 ** 1.5)

In [None]:
print(3 == 3.0)

All of the above involve mixing integers with floats in an expression. The last piece of code even told us that the integer version of 3: 3 and the float version of 3: 3.0 are the same number, despite them being of a different 'type'.

However, in some cases we really do need to keep our floats separate from our integers. We'll talk about some cases later in this module. But for now, let's just take it for granted that sometimes we want our 3 to be a float and for our 3.0 to be an integer.

For instance, let's say you have a variable 'page_count' containing the number of pages in the text:

In [None]:
page_count = 125.0
print(f"There are {page_count} pages in this text - the data type of 'page_count' is {type(page_count)}")

Is page_count being a float problematic in this case? Not really. But in some cases we'd really want (or need) for it to be an integer.

Fortunately Python has functions that can change the type of variables from one type to another. Fortunately, they're pretty easy to remember too:

* `int(variable)` - change variable to an int
* `float(variable)` - change variable to a float
* `str(variable)` - change variable to a str (string)
* `bool(variable)` - change variable to a bool (boolean)

For right now we'll just look at those first two:

In [None]:
page_count = int(page_count) # Overwrites what is in the 'page_count' variable with the integer version of what it currently contains
print(f"There are {page_count} pages in this text - the data type of 'page_count' is {type(page_count)}")

In [None]:
page_count = float(page_count) # Overwrites what is in the 'page_count' variable with the floating-point version of what it currently contains
print(f"There are {page_count} pages in this text - the data type of 'page_count' is {type(page_count)}")

That worked! These two pieces of code just swap whether page_count is an int or a float and verifies that to you by printing it.

What happens though, when we try to change a float like 307.999 into an int?

In [None]:
avg_words_per_page = 307.999
avg_words_per_page = int(avg_words_per_page)
print(avg_words_per_page)

Well, it's an int now... but it appears to have just lopped off the decimal despite how close to the next integer it is. That's not necessarily the way we'd want it to work.

Let's use the `round(number, decimals)` function to let decimals >= .5 round up.
* number = the number to be rounded
* decimals = the number of decimals to round (0 for rounding to the next whole number)

In [None]:
avg_words_per_page = 307.999
print(f"Step 0: average words per page is {avg_words_per_page} and the data type is a {type(avg_words_per_page)}.")

avg_words_per_page = round(avg_words_per_page, 0) # Overwrites what is in avg_words_per_page with a rounded version of what it currently contains
print(f"Step 1: average words per page is {avg_words_per_page} and the data type is a {type(avg_words_per_page)}.")

avg_words_per_page = int(avg_words_per_page)
print(f"Step 2: average words per page is {avg_words_per_page} and the data type is a {type(avg_words_per_page)}.")

Better! The last foundational function we should talk about with numbers (both int and float is absolute value.

`abs(number)`

This behaves exactly like absolute value does in math, it returns the positive version of the supplied number.

In [None]:
neg_float = -37.2
neg_int = -107
print(f"The absolute value of {neg_float} is {abs(neg_float)}")
print(f"The absolute value of {neg_int} is {abs(neg_int)}")

### 2.1.2. Booleans
---

We've actually covered most of what you need to know about booleans (bool) already, we just presented it in the context of boolean operations in the first module.

Unlike numbers and text, there are only two values bool objects can have: `True` and `False`

In [None]:
python_interesting = True
print(f"Aaron thinks Python is interesting: {python_interesting}. That variable is of type: {type(python_interesting)}")
print(f"The opposite of {python_interesting} is {not python_interesting}")

In the previous module, we also learned that even strings/numbers have a 'truthiness' to them, and that an item's truthiness can be found using the `bool()` function.

That function is the direct parallel to the `int()` and `float()` functions we looked at for integers and floats. It tells us what would the value be if it was a boolean data type!

In [None]:
bool(0)

In [None]:
bool(42)

In [None]:
int(True)

In [None]:
float(False)

Here you can see that 0 is False and 42 is True when represented as a boolean. 

you also see that True and False have their own representations when cast as an int (1/0) or float (1.0/0.0).

That's it! That's really all you need to know about booleans for right now.

### 2.1.3. Strings
---

We would be remiss in a series about computer-aided text analysis if we did not cover strings. After all, this is *text* analysis and Python stores text in strings.

As we saw in the previous module, strings are easy to identify because they are offset by quotes either single `'` or double `"`. 

We also saw that if you want to be able to use those offsets inside the string they have to have a backslash `\` before them.

In [None]:
company_name = 'Panucci\'s Pizza'
print(f"I am hungry for some {company_name}")

Notice how when I printed the string out, that backslash didn't get printed? That's good, right? Because we didn't really want that backslash there, that backslash just is supposed to tell Python that the `'` is an apostrophe, not the end of the string.

In strings, the backslash is called an **escape character** and it basically tells the string to treat whatever comes immediately after it a bit differently than it usually would.

Take for instance:

In [None]:
# \' and \" add single/double quotes to the string
print("Shakespeare said, \"All the world is a stage.\"")

In [None]:
# \t adds a <tab> to the string
print("Let\tus\tspace\tthings\tout.")
print("Do\tthese\tline\tup\tnicely?")

In [None]:
# \n breaks up a string into multiple lines
print("Nobody: ... \nAaron: Let's do some text analyses!")

In [None]:
# \\ allows you to include a \ as part of the string itself
print("...but what if I actually want to have a \\ in the string?")

Play around with these. What happens if you remove the leading backslash?

In the previous sections, we talked used the data type name (e.g., 'int', 'bool') to convert from one data type to another. We can do the same thing with strings using:

`str()`

In [None]:
str(1)

In [None]:
str(-32.854)

In [None]:
str(True)

And *sometimes* we can go the other way around...

In [None]:
int("4")

In [None]:
float("35.7")

In [None]:
bool("Spam")

But we have to be careful... Python is limited in its ability to represent strings as numbers.

In [None]:
int("4.0")

In [None]:
int("four")

As a human, we know exactly what *we want* the output to be here, but Python says it's one step too many. There are a number of ways we can convert those to integers, but that's beyond the scope of what we're doing here.

For our purposes, just know that we have to be a bit careful changing strings to numbers.

Before we dive deeper into new string functions I want to introduce one last way of writing a string.

We've seen that with the help of escape characters we can create multi-line strings that are a lot more like the documents we are likely to work with in our research:

In [None]:
page_1_abridged = "Title of my Awesome Paper\n\nABSTRACT\nThis is my abstract\n\nINTRODUCTION\nThis is the introduction to my paper..."
print(page_1_abridged)

It works, but it is a bit clunky when we're writing it. If you want to be able to write a string across multiple lines and have Python recognize that when you press enter, that is a new line `\n` in the string. We can use three quotes (`'''` or `"""`) as offsets instead of single quotes.

Consider:

In [None]:
page_1_abridged = """Title of my Awesome Paper

ABSTRACT
This is my abstract

INTRODUCTION
This is the introduction to my paper..."""
print(page_1_abridged)

Handy! And better yet, more legible.

#### 2.1.3.1. Inspecting Strings' Contents
---

When type strings into variables ourselves, we know exactly what is in them. But in our research we're going to be getting LOTS of strings and from lots of different sources. Chances are, we're not going to want to print them all to our screen and evaluate what's in them ourselves.

Fortunately, there are ways of summarizing the contents of what's in the string for us. Let's walk through some of them:

* `len(string)` *function* - tells us the length of the string (in characters)
* `in` *operator* - tells us whether the string on the left of `in` is contained within the string on the right
* `not in` *operator* - tells us whether the string on the left is NOT contained within the string on the right
* `islower()` *method* - tells us whether the letters in the string are all lower case
* `isspace()` *method* - tells us whether the string is comprised just of spaces, new lines, and tabs
* `isalpha()` *method* - tells us whether the string is comprised only of letters
* `isdigit()` *method* - tells us whether the string is comprised only of digits
* `isalnum()` *method* - tells us whether the string is comprised only of letters and digits.
* `count(substring)` *method* - counts the number of time a *substring* occurs in the original string

Before we see some examples of these, you'll observe that I made a big deal about these being functions, operators, or methods. That's worth discussing for a second because that changes how we call them.

* *Functions* (like `len()`) stand on their own and we pass them data inside their parentheses. For example `len("my string")`.
* *Methods* (like `islower()`) are like functions, but they belong to an object of some sort (in this case the string itself). Here, we generally put that object first and connect the method to it with a `.`. For example `"my string".islower()`.
* *Operators* (like `in`) stand on their own and either precede (for unary operators like `not`) or sit between (for binary operators like `in`) the data they will use.

I know this sounds complicated, but when you see it in practice below, I think you'll get the hang of it. For example `"my" in "my string"`.

In [None]:
mystery_string = "Life is like a box of chocolates. You never know what you're gonna get."

In [None]:
# Because 'len()' is a *function*, len sits on its own and the string is passed to it in parentheses 
num_chars = len(mystery_string)

print(f"There are {len(mystery_string)} characters in the mystery string.")

In [None]:
# Because 'in' is a *binary operator* we sit it between the two pieces of data it works with (just like +, -, ==, and other operators we've used)
contains_chocolate = "chocolate" in mystery_string

print(f"Does the mystery string contain chocolate? {contains_chocolate}")

In [None]:
omits_spam = "spam" not in mystery_string
print(f"Does the mystery string omit the word spam? {omits_spam}")

In [None]:
# Because 'islower()' is a *method*, we attach it to the string itself with a '.'
# Further, because it doesn't need any additional information other than the contents of the string, we don't need to pass any data in the parentheses.
all_lowercase = mystery_string.islower()

print(f"Is the mystery string all lowercase? {all_lowercase}")

In [None]:
# isalpha() is also a method, so we use it the same way:
all_letters = mystery_string.isalpha()

print(f"Is the mystery string comprised only of letters? {all_letters}")

In [None]:
# While the above have used strings stored in a variable, methods (like functions) can be applied directly to a string you type in as well.
all_digits = "8675309".isdigit()

print(f"Is Jenny's phone number all digits? {all_digits}")

In [None]:
# 'count()' is a method that needs two pieces of information: The original string is the 'owner' of the method
# and so we'll put that before the dot. However, we also need to tell it what 'substring' to count in the original string.
# We'll pass that in the parentheses.

#Count the number of lowercase 'e's in the mystery string
count_e = mystery_string.count('e')

print(f"There are {count_e} e's in the mystery string")

Play around with these. Remember, you're not going to permanently break anything here. Worst case scenario, you simply reload the Notebook from my GitHub repository and all your changes will be gone.

#### 2.1.3.2. Changing Strings
---

Once we have some idea of what's inside the string, we might decide to make some changes to the string.

Before we actually see some methods for doing that... why would we do that? After all, these strings represent texts we collected and want to describe scientifically. If we're making changes to them, doesn't that invalidate our data?

Well... let's say we want to do something pretty simple: count the number of references to "ER" in a statement by a hospital...

In [None]:
hosp_stmt = """We are pleased to announce that our new er is now up and running 24x7. This 
new E.R. is going to be named for famous statistician and nurse Florence Nightingale. Come visit the new E
R at any time!"""
search_string = "ER"

refs_to_er = hosp_stmt.count(search_string)

print(f"The hospital statement reads: {hosp_stmt}.\n")
print(f"There were {refs_to_er} references to the {search_string} in the statement.")

That... doesn't seem right. I count three references to the ER in the original text:

1.   "...the new er is now..."
2.   "...The E.R. is going..."
3.   "...visit the new ER..."

So why is the count zero? Let's take a look at some simpler comparisons:

In [None]:
#Original == Searched
print("er" == "ER")
print("E.R." == "ER")
print("E\nR" == "ER")

The problem was that "ER" doesn't *exactly* match any of those. 

*   The original is lowercase, we were searching for uppercase
*   The original is punctuated by periods, we were searching without
*   The original had a line-break between the E and the R, we were searching for contiguous letters.

Certainly, we could have changed what we were searching for so that we didn't have to change the original text. And that might make sense for searching in one text as short as this one. But that requires us to know what is in the text ourselves and to find the differences manually. At that point, we might as well just code the text manually... we're doing it anyway!

Now imagine we're looking for tens/hundreds of different words in thousands of texts. And each one could be slightly different in a slightly different way each time. Consider:
* Happy

* happy

* HAPPY

* Hap
  
  py

* ha
  
  ppy

* HAPP

  Y

And so on... 

It would be obscenely difficult to identify all possible permutations of the way the original texts might have presented the words we're looking for. Accordingly, we take the most common problems with naturally occuring text and standardize them so that we eliminate some of that variability.

The assumption here is that our theory doesn't really care about line breaks, tabs, capitalization, and so on. That's not always the case, but it is enough of the time that it's worth knowing how to do these kinds of things.

OK, enough pontificating. Let's see how to do some of these things:

* `lower()` method - change all letters to lowercase
* `strip()` method - removes extra whitespace on either side of the string
* `replace(x, y)` method - replace all instances of x in the string with y
* `split(delimiter)` method - breaks a string into a list of multiple strings based on the delimiter

There are other versions of these such as `upper()`, `rstrip()`, and `partition()` and on, but the above are the ones you're most likely to use.

Let's take a look at how these work.

In [None]:
crazy_text = "OoOoOoOh, YoU aRe SoOoOo FuNnY!"
print(f"A mildly invigorating troll wrote \"{crazy_text}\" in an online forum.\n")

cleaner_text = crazy_text.lower()
print(f"I fixed it for them: \"{cleaner_text}\"")

In [None]:
original = " Aaron F McKenny "
matching_row_in_spreadsheet = "Aaron F McKenny"
print(f"Do \"{original}\" and \"{matching_row_in_spreadsheet}\" match? {original == matching_row_in_spreadsheet}\n")

stripped_original = original.strip()
print(f"How about \"{stripped_original}\" and \"{matching_row_in_spreadsheet}\"? {stripped_original == matching_row_in_spreadsheet}")

In [None]:
abbr_text_speak = "L.O.L. O.M.G. I.G.G. T.T.Y.L."
search_string = "LOL"
print(f"Is \"{search_string}\" in \"{abbr_text_speak}\"? {search_string in abbr_text_speak}\n")

# Let's replace all periods with nothing (an empty string, not even a whitespace)
std_text_speak = abbr_text_speak.replace(".", "")
print(f"Is \"{search_string}\" in \"{std_text_speak}\"? {search_string in std_text_speak}\n")

In [None]:
three_sentences = "This is sentence one. This is sentence two. This is sentence three."
print(f"The three sentences are: {three_sentences}\n")

# Let's split that string up based on places where it finds a period followed by a space.
list_of_sentences = three_sentences.split(". ")
print(list_of_sentences)

In that last block of code, you'll notice that the list of sentences is delimited by commas and enclosed in square brackets. This is how Python identifies lists. We'll talk more about lists later.

For now, let's go back to our original problem:

In [None]:
hosp_stmt = """We are pleased to announce that our new er is now up and running 24x7. This 
new E.R. is going to be named for famous statistician and nurse Florence Nightingale. Come visit the new E
R at any time!"""
search_string = "ER"

refs_to_er = hosp_stmt.count(search_string)

print(f"The hospital statement reads: {hosp_stmt}.\n")
print(f"There were {refs_to_er} references to the {search_string} in the statement.")

Let's **preprocess** our text so that our counting works! We won't use all of the methods above, even just a couple will do the trick.

In [None]:
print(f"START: \n{hosp_stmt}\n\n")

#Step 1. Make everything lowercase
hosp_stmt = hosp_stmt.lower()
print(f"Step 1: \n{hosp_stmt}\n\n")

#Step 2. Remove newlines
hosp_stmt = hosp_stmt.replace("\n", "")
print(f"Step 2: \n{hosp_stmt}\n\n")

#Step 3. Remove periods
hosp_stmt = hosp_stmt.replace(".", "")
print(f"Step 3: \n{hosp_stmt}\n\n\n")

#Ok, our search string is currently 'ER', but we know there's no uppercase characters in our text anymore... let's search for 'er' instead
search_string = "er"

refs_to_er = hosp_stmt.count(search_string)

print(f"The hospital statement reads: {hosp_stmt}.\n")
print(f"There were {refs_to_er} references to the {search_string} in the statement.")

Three references found! That's better!

This isn't exactly how we'll end up doing it in our real text analyses, but the idea is the same and we'll be seeing these methods again!

##2.2. Common Data Structures: Tuples, Lists, and Dictionaries
---

The basic data types we covered above are called 'primitives' because they are the basic building blocks from which other more complex data structures are built.

In this section, we're going to talk about three such structures:

* Tuples ('tuple' in Python) 
* Lists ('list' in Python)
* Dictionaries ('dict' in Python) 

There are a TON more data structures than this. But again, these are the ones most important to understand now.

###2.2.1. Lists and Tuples
---

When we collect not one, not two, but an entire corpus of texts, we're not going to want to create a new variable for each one. Imagine creating a new variable for each tweet Elon Musk sends out for our text analyses... frightening.

Fortunately, there are a number of different sequence types in Python where we can create one variable and just tell Python to add each new one to the end of the sequence and assign it a number. We're going go talk about two of these: Tuples and Lists.

There are several differences between Tuples and lists, and if you get really into Python it's worth understanding the differences... but for our purposes here, we're concerned with two differences:

1. Lists are offset by square brackets `[a,b,c]` whereas Tuples are offset by parentheses `(a,b,c)`.
2. Lists can be changed after they are created, Tuples cannot (though they can be overwritten). That is: lists are 'mutable' and Tuples are 'immutable'

Other than that, most of the things we want to do with one, we can also do with the other.

Let's get started by looking at and tinkering with some lists and tuples:

In [None]:
my_tuple = (1, 2, 3) # Creating a tuple (note the parentheses offset) and assigning it to the variable 'my_tuple'
print(my_tuple)
print(f"Variable: my_tuple is of type {type(my_tuple)}")

# let's add to it
extended_tuple = my_tuple + (4, 5, 6)
print(f"\nI added to the tuple, and it now contains: {extended_tuple}"))

# Tuples and lists can contain more than just numbers
text_tuple = ("Text 1", "Text 2", "Text 3")
print(f"\nThe text tuple I created contains: {text_tuple}")

In [None]:
#Let's do the same thing, but with a list
my_list = [1, 2, 3] # Creating a list (note the square bracket offset) and assigning it to the variable 'my_list'
print(f"The list contains: {my_list}")
print(f"Variable: my_list is of type {type(my_list)}")

extended_list = my_list + [4, 5, 6]
print(f"\nI added to the list, and it now contains: {extended_list}")

text_list = ("Text 1", "Text 2", "Text 3")
print(f"\nThe text list I created contains: {text_list}")

####2.2.1.1. How Python Counts
---

As humans, we tend to start counting with the number 1 because we're used to counting discrete objects. 

Most computer programming languages start counting with the number 0 because they're pointing to places in the computer's memory (and the first memory address in binary is 0000 0000... zero).

So why talk about this now? Aren't we talking about lists and tuples? What does any of this have to do with lists and tuples? 

Well, unless everytime you work with a list/tuple you want to reference the **ENTIRE SEQUENCE** (not just one of the items it contains), you need to understand how Python assigns indices to each element so you can point to the one you want to work with.

So let's look at list of the first 5 'abc's to keep it simple.

In [None]:
abc_first_five = ['a', 'b', 'c', 'd', 'e']

If Python starts counting from zero, that means that:
* a = 0

* b = 1

* c = 2

* d = 3

* e = 4

Let's verify this by using the `index()` method of sequences. The index returns the index number the specified value in the list. 

So `abc_first_five.index('c')` will find 'c' in the list and return to us the first location of 'c' in the list (i.e., its index).

In [None]:
print(f"The index of a in abc_first_five is {abc_first_five.index('a')}")
print(f"The index of b in abc_first_five is {abc_first_five.index('b')}")
print(f"The index of c in abc_first_five is {abc_first_five.index('c')}")
print(f"The index of d in abc_first_five is {abc_first_five.index('d')}")
print(f"The index of e in abc_first_five is {abc_first_five.index('e')}")

OK, so given the contents of a list, we can now get the index. I guess that's handy... but usually we're looking to go the other way around. I know what the index I want is, and I want Python to give me the contents of that index in the list.

To get items from sequences in python we use square brackets following the sequence. For instance, if I want the item at index 2 from our list, I might write:

In [None]:
print(f"The element at index 2 in abc_first_five is \"{abc_first_five[2]}\"")

At some point (for me, many points) in your use of Python you will decide "I want to print the nth item of the list out and pull something like this:

In [None]:
print(f"The first letter of the alphabet is \"{abc_first_five[1]}\"")

Do you see the problem? When I was saying first I was thinking *as a human would count* not *as a computer would count*.

To make this make sense, I have to remember what part of that statement is intended for the computer, and what part of that statement is intended for the user of the computer.

In this case, my saying "The first letter" is intended for the user, so I'll leave that as is because the user is also human and starts counting alphabet letters starting with one as well.

On the other hand, the `abc_first_five[1]` bit is intended for the computer. Computers start counting from zero, so if I want the first item in the list, I need to tell it to provide the element at index **zero**!

Let's try it:

In [None]:
print(f"The first letter of the alphabet is \"{abc_first_five[0]}\"")

That's better! By the way, this works exactly the same with tuples as well. We're just using lists right now for simplicity.

Again, unless you're superhuman, this mistake is likely to happen to you at some point... if Python ever gives you an item adjacent to what you think it should have given you, think to yourself, 'did I count from zero?'

####2.2.1.2. Working with Individual Elements
---

Now that we know how Python counts and how to access a single element from an index number, let's play around with it a little more...

In [None]:
my_tweet = "It is a beautiful day in Bloomington! A day like this is one in a million." # One text as a string

sentence_list = my_tweet.split("!") # Split the string into a list of sentences delimited by exclamation points
print(f"The sentences are: {sentence_list}")
print(f"it is of type {type(sentence_list)}")

print(f"\nsentence_list[0] (the first sentence) is \"{sentence_list[0]}\" and it is of type {type(sentence_list[0])}")
print(f"sentence_list[1] (the second sentence) is \"{sentence_list[1]}\" and it is of type {type(sentence_list[1])}")

word_list = sentence_list[0].split(" ") # Split the first sentence (i.e., at index 0) into a list of words delimited by spaces
print(f"\n\nThe words are: {word_list}")

Let's unpack this a little bit. We're starting to see elements of text analysis emerge a little more in this code. 

In text analysis we generally aren't going to be working with the text as one monolithic string. We want to break them down into smaller units: paragraphs/sentences/words. Here we use the `split()` method of strings from 2.1.3.2. to convert the original tweet (string) into a list of sentences by splitting on the exclamation point.

We then see that the result is a list that contains two strings. Look at the two sentence-level print statements in the middle. Notice how I'm using the `type()` function, but passing it not only the variable name, but also the index? This tells python that I want to know what type of data is in that index of the list. That's why it returns '<class 'str'>'... the sentence stored there is a string.

I then do something similar when I create the word_list variable. Notice that I use the `split()` method of sentence_list[0], not just sentence list. That's because `split()` is a method for strings, not lists. sentence_list is a list... but index 0 of sentence_list contains a string: the first sentence of the tweet. That is why the result of splitting further by spaces takes that first sentence and breaks it into a list of words.

The key here for all of this is: when you add an index to the list you're working with (e.g., sentence_list[0]), it returns the object at that index in the list and you can treat it like it's not even in a list at that point.

So what about changing the contents of lists? That can be done with indexing too:

Remember how we assigned values to variables using `=`? Well we'll do the same here. The difference is, we just need to tell it what element in the list to change:

In [None]:
print(f"Original: {word_list}")

# Let's change the 4th word (index 3) to 'gorgeous' instead of beautiful:
word_list[3] = "gorgeous"
print(f"\nReplaced: {word_list}")

In [None]:
print(f"Original: {word_list}")

# We also mentioned before wanting to do some preprocessing of words before doing text analyses.
# 'It' has capital letters... let's fix that!
word_list[0] = word_list[0].lower()
print(f"\nWith \"It\" set to lowercase: {word_list}")

# So does 'Bloomington'... let's fix that too!
word_list[-1] = word_list[-1].lower()
print(f"\nWith \"Bloomington\" set to lowercase: {word_list}")

Wait! What happened there? Shenanigans! In that last last operation I used an index of -1. If Python starts counting at 0, what on earth would a list index of -1 mean? That seems out-of-bounds.

When used as a list index, negative numbers mean to start counting from the **end** of the list and move backwards. So that means:
* index -1 is the last item in the list: "Bloomington"
* index -2 is the 2nd to last item in the list: "in"
* index -3 is the 3rd to last item in the list: "day"... and so on.

And before you call me up about it, I know, it's a little weird that -1 is the first item when counting backwards and 1 is the second item when counting forwards. But don't blame me, blame math for not giving us a negative zero to use when counting backwards... remember positive zero is already taken...

From the introduction demonstration, we've already seen that you can combine two lists by adding one list to another list:

In [None]:
["a","b","c"] + [1,2,3] + ["do", "re", "mi"]

There is a *method* for this that saves the resulting list back into the first as well: `extend()`

In [None]:
abcs = ["a", "b", "c", "d", "e"]
fghs = ["f", "g", "h", "i", "j"]
abcs.extend(fghs)
print(f"The extended list is: {abcs}")

But sometimes we want add/insert/remove elements one at a time. These methods are called: `append()`, `insert()`, `pop()`

In [None]:
print(f"Original: {abcs}")

abcs.append("k") # Add 'k' to the end of the list
print(f"\nWith k added to the end: {abcs}")

abcs.insert(0, "z") # Add 'z' to index 0 (remember: as the computer counts) of the list, moving everything else right one spot
print(f"\nWith z inserted at slot 0: {abcs}")

abcs.pop(0) # Pops off (removes) the item in index 0 off of the list (the redundant z)
print(f"\nWith z removed at slot 0: {abcs}")

There is also a `remove()` method. Here instead of telling it what index to remove, you tell it what *data* to remove and it'll remove the first instance of that data.

In [None]:
print(f"Original: {abcs}")

abcs.append("k") # Add 'k' to the end of the list
print(f"\nUh oh, now we have too many k's: {abcs}")

abcs.remove("k") # Find the first 'k' in the list and remove it
print(f"\nWith an appropriate number of k's: {abcs}")

You know how we've been changing things individually in these lists? Yeah, can't do that with tuples.

In [None]:
best_fifth_element_character = ("Jean-Baptiste", "Immanuel", "Zorg")
print(f"The best fifth element character is {best_fifth_element_character}")
print(f"He works at {best_fifth_element_character[2]} Industries")
best_fifth_element_character[2] = "Rhod"

That's what it means for Tuples to be immutable. Remember that one of the two differences (that we care about right now) between tuples and lists is that tuples are immutable and lists are mutable.

If you want to change a tuple, you have to overwrite the *entire* thing.

In [None]:
print(f"Original: {best_fifth_element_character}")

best_fifth_element_character = ("Ruby", "Rhod")
print(f"Overwritten: {best_fifth_element_character}")

#### 2.2.1.3. Working With Multiple Elements
---

We've looked at how to look at an entire list and how to get an individual item from the list. But what about how to get a small section of the list?

That's a pretty simple extension of getting a single item. We still use the square brackets, but now we provide a 'slice' in the brackets.

A slice has three parts:

1.   A starting point
2.   A stopping point, and
3.   A step (interval)

For instance, this step: `0:5:1` tells python to:
1.   Start with index 0
2.   Stop right before index 5, and 
3.   Grab every single one in that range (1)

If you leave one of these numbers black it assumes you want the default:

1.  Default starting point: 0 (first element)
2.  Default ending point: -1 (last element)
3.  Default step: 1 (all items)

Let's try it:

In [None]:
abc_list = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

abc_subset = abc_list[0:5:1] # grabs the first five elements in the list (indexes 0 through 4)
print(f"First five: {abc_subset}")

abc_subset = abc_list[5:10:1] # grabs the second five elements in the list (indexes 5 through 9)
print(f"\nSecond five: {abc_subset}")

abc_subset = abc_list[::] # nothing entered = use the defaults, start at beginning, end at end, step through each one
print(f"\nFull list: {abc_subset}")


It's important to remember that the second item in a slice tells it which index to stop BEFORE. In other words, it doesn't include that item.

Let's now play around with the step (interval) a bit. 1 grabs every item in the list between the start and stop point. Let's look at 2, 3, -1, and -2!

In [None]:
abc_subset = abc_list[::2]
print(f"Every other letter: {abc_subset}")

abc_subset = abc_list[1::2]
print(f"\nEvery other letter starting with index 1: {abc_subset}")

abc_subset = abc_list[::3]
print(f"\nEvery third letter: {abc_subset}")

abc_subset = abc_list[::-1]
print(f"\nEvery letter, but step backwards through the list: {abc_subset}")

abc_subset = abc_list[24::-2]
print(f"\nEvery other letter, but step backwards through the list and start with index 24: {abc_subset}")

Note that when moving backwards through the list (i.e., when step is negative), the 'start' and 'stop' have to reflect that.

In other words the following slice makes sense:
`[5, 1: -1]`
because we're starting at index 5 and moving backwards to just before index 1.

On the other hand `[1, 5, -1]` will return an empty list because you cannot get from index 1 to index 5 by moving backwards.

Let's try it:

In [None]:
abc_subset = abc_list[5:1:-1]
print(f"This will works because we're starting at 5 and moving backwards to just before 1: {abc_subset}")

abc_subset = abc_list[1:5:-1]
print(f"\nThis won't work because you can't get from 1 to 5 by moving backwards: {abc_subset}")

---
**Slicing Strings**

Remember how we said that strings are "a series of characters all strung together inside quotes"... we can actually use slicing with strings as well to access parts of that 'series'.

Let's try it:

In [None]:
word_42 = "ambidexterity"

print(f"The prefix for the word is: {word_42[0:4]}")
print(f"The root of the word is: {word_42[4:]}")
print(f"\nJust for fun...\nThe word spelled backwards is: {word_42[::-1]}")
print(f"Every other letter in the word is: {word_42[::2]}")

#### 2.2.1.4. Useful Functions/ Methods
---

One thing we liked about strings is we could count the number of times a substring appeared in the string using the `count()` method and the length of the string using the `len()` function. We can do that with both tuples and lists too! 

In [None]:
sentence_as_a_tuple = ("your", "jokes", "are", "funny", "ha", "ha", "ha", "very", "funny")
sentence_as_a_list = ["your", "jokes", "are", "funny", "ha", "ha", "ha", "very", "funny"]

print(f"Aaron laughed {sentence_as_a_tuple.count('ha')} times in the tuple")
print(f"Aaron laughed {sentence_as_a_list.count('ha')} times in the list")

print(f"The list has {len(sentence_as_a_list)} words in it.")

Other useful *methods* such as reversing, sorting, and clearing are only possible in lists because it saves the result (and tuples are immutable).

There are *functions* such as `sorted()` and `reversed()` that can replicate this functionality in tuples, but those go beyond what I'm going to cover here.

In [None]:
keyboard_letters = ['q','w','e','r','t','y','u','i','o','p','a','s','d','f','g','h','j','k','l','z','x','c','v','b','n','m']
print(f"Original: {keyboard_letters}")

keyboard_letters.sort()
print(f"\nSorted: {keyboard_letters}")

keyboard_letters.reverse()
print(f"\nReversed: {keyboard_letters}")

keyboard_letters.clear()
print(f"\nCleared: {keyboard_letters}")

You've probably noticed that when we have printed out lists, it still very much looks like a list - even when that list is a list of words.

Consider:

In [None]:
sentence_list = ["This", "sentence", "is", "actually", "a", "list."]
print(f"The sentence is: \"{sentence_list}\"")

The fact that we were able to turn that string into a list of words is good in most cases. 

But sometimes we'd like to turn it back into a string. For instance, when we output the sentence, it'd be nice for it to actually look like a sentence rather than a list.

However, doing so requires more than just using the `str()` function like we've done in the past. Let's see what `str()` does for us here:

In [None]:
print(f"sentence_list is of data type: {type(sentence_list)} and contains: \"{sentence_list}\"")

# Let's tell Python to make a string out of it
sentence_string = str(sentence_list)
print(f"\nsentence_string is of data type: {type(sentence_string)} and contains: \"{sentence_string}\"")

The problem is, Python doesn't know that the list contains a sentence... or what a sentence is for that matter. For all it knows these are all different chunks of a single word...

Given this, how would it decide 'how' to combine the list elements into one string? It doesn't know, and so when you cast it as a string, it just puts quotes around the entire string and calls it a day.

If we want it to actually integrate the list elements, we're going to have to tell it how. The way we do this is with a *string method* called `join()`.

Join basically tells python to join the list elements together with the string you provide. Let's try it!

In [None]:
print(f"sentence_list is of data type: {type(sentence_list)} and contains: \"{sentence_list}\"")

# We want to use a space to join the elements together so we use the " " string's join function
sentence_string = " ".join(sentence_list)
print(f"\nsentence_string is of data type: {type(sentence_string)} and contains: \"{sentence_string}\"")

# If we wanted to replicate actual spoken text, we could join using a vocal distractor
sentence_string = " uh... ".join(sentence_list)
print(f"\nsentence_string is of data type: {type(sentence_string)} and contains: \"{sentence_string}\"")

###2.2.2. Dictionaries
---

When you think of a traditional dictionary, you think of starting with a word, and using that as a 'key' to find a matching definition. In Python, dictionary's ('dict' in Python) serve very much the same purpose. One piece of data is a 'key' to give you access to another piece of data.

Let's start with an example:

You have three files:
1. Microsoft 2021 shareholder letter ("msft_21.docx")
2. Apple 2021 shareholder letter ("aapl_21.docx")
3. Alphabet 2021 shareholder letter ("googl_2021.docx")

You load the contents of these shareholder letters into Python (and presumably many others, but we'll work with these three), but how do you keep track of them?

* We talked about what a nightmare it would be to try to have one variable per file... no way!

* We could store them in a tuple or a list, but then when we want to find one, how do we find microsoft's? Do we have to create a second list just to keep track of what index number is associated with what text? Sounds inefficient.

* One simple way might be to use a dictionary and use the filename (or company name, if you prefer) as the key!

Let's see what this might look like:

In [None]:
shareholder_letters = {"msft_21.docx": "Microsoft's letter text would be here",
                       "aapl_21.docx": "Apple's letter text would be here",
                       "googl_21.docx": "Alphabet's letter text would be here"}

print("The dictionary has been created!")

Let's dissect the creation of that dictionary. It's a little more complicated than the other data types and structures we saw. 

Let's start with two things you can push aside:
* `shareholder_letters =` - this is exactly like every other variable assignment we've seen. Nothing different here
* I've broken up the dictionary into multiple lines. I've done this to make reading it easier and for no other reason. This would work equally well if it were all on one line (it'd just be much harder to read).

OK. Now let's look at what makes it a dictionary:
* Braces `{}` - instead of being offset by square brackets (as with lists) or parentheses (as with tuples), dictionaries are identified by braces
* Each entry within the braces is a `key: value` pair and ends with a `,` (except for the last one). Unlike lists and tuples where you only provide the value, with dictionaries Python needs you to tell it the key for that value as well. Imagine a traditional dictionary with definitions, but no words associated with those definitions. That would be infuriating! The same thing applies here.

So let's examine the code above then. The variable name is `shareholder_letters` and the key: value pairs are the name of the text and some placeholder text. We did this for three companies, but like lists and tuples, these can grow pretty large if you want them to.

Let's see how we can get Python to give us the data from this dictionary:

In [None]:
microsoft = shareholder_letters["msft_21.docx"]
print(f"Microsoft's shareholder letter said: \"{microsoft}\"")

print(f"Apple's shareholder letter said: \"{shareholder_letters['aapl_21.docx']}\"")

As with lists and tuples, dictionaries' data can be accessed using square brackets after the variable name as we see above. 

If we're very careful, then all is well. However, there is a potential hazard here. Try executing the code below:

In [None]:
ibm = shareholder_letters["ibm_21.docx"]

Just like that, our code grinds to a halt. We never added IBM's shareholder letter to the dictionary. So when we tried to access it, Python doesn't know what to do!

Sometimes we want this to happen, we may have typed something in wrong and this warns us that there is a problem. However, other times we might just want for it to return a default value. For instance `""` - an empty quote. If we do not have the text for something, an empty quote makes sense as a potential output.

To enable a default value when an invalid key is passed, we'll use the `get()` method:

In [None]:
microsoft = shareholder_letters.get("msft_21.docx", "")
print(f"Microsoft's shareholder letter said: \"{microsoft}\"")

apple = shareholder_letters.get("aapl_21.docx", "")
print(f"Apple's shareholder letter said: \"{apple}\"")

ibm = shareholder_letters.get("ibm_21.docx", "")
print(f"IBM's shareholder letter said: \"{ibm}\"")

Again, this may not be what we want... if this is going to trick you into believing that IBM issued a shareholder letter, but that it was completely empty, this may be a bad idea. But in other cases, this might make sense.

Another thing to note about dictionaries:
* The **keys** must be immutable. For example, you could use a tuple, but not a list, as a key for the dictionary.
* The values can be any data type. 

Let's try that:

In [None]:
# Using other types as values
ibm_information = {"num_employees": 345900,
                   "ceo_name": "Arvind Krishna",
                   "board_members": ["Thomas Buberl", "Michael Eskew", "David Farr"]}

print(f"IBM is a company with over {ibm_information['num_employees']} employees and is led by {ibm_information['ceo_name']}.")

In [None]:
# Using other types as keys
tv_movie_references = {42: "The Hitchhiker's Guide to the Galaxy",
                       "spam": "Monty Python's Meaning of Life",
                       (4, 8, 15, 16, 23, 42): "Lost"}

print(f"The number 42 was important in {tv_movie_references[42]}.")
print(f"The tuple (4, 8, 15, 16, 23, 42) was important in {tv_movie_references[(4, 8, 15, 16, 23, 42)]}.")

In [None]:
# But you can't use lists as keys because they're mutable!
this_will_fail = {["Not", "Gonna", "Work"]: "See, I told ya so!"}

Like with the other data types and structures, you *can* use the `dict()` method to build a dict from some other data type. 

However, in my experience I've only ever used it maybe three times. So we'll just leave it at 'that technically exists' and move on. You're not likely to need it much to do text analysis.

####2.2.2.1. Working with Individual Entries
---

We've seen that you can access the value of a dictionary entry by using square brackets after the variable name. This is also one way of adding or changing an entry in a dictionary... just add the `=` assignment operator and a value!

In [None]:
ibm_information = {"num_employees": 345900,
                   "ceo_name": "Arvind Krishna",
                   "board_members": ["Thomas Buberl", "Michael Eskew", "David Farr"]}

# list() converts dictionaries into a list of their keys!
print(f"Keys in the dictionary before addition: {list(ibm_information)}")
# len() tells you how many keys are in it
print(f"The dictionary currently has {len(ibm_information)} keys.")

# Add the 2020 revenue key, but we don't know what the value is yet. So we'll leave it at 0
ibm_information["2020_revenue"] = 0
print(f"\nKeys in the dictionary after deletion: {list(ibm_information)}")
print(f"The dictionary currently has {len(ibm_information)} keys.")

ibm_information["2020_revenue"] = 73600000000
print(f"\nIBM made ${ibm_information['2020_revenue']} in revenue for FY 2020.")

If you want to remove an entry from the dictionary, you use the 'del' *unary operator*.

In [None]:
ibm_information = {"num_employees": 345900,
                   "ceo_name": "Arvind Krishna",
                   "board_members": ["Thomas Buberl", "Michael Eskew", "David Farr"],
                   "2020_revenue": 73600000000}
print(f"Keys before deletion: {list(ibm_information)}")
print(f"The dictionary currently has {len(ibm_information)} keys.")

del ibm_information["2020_revenue"]

print(f"\nKeys after deletion: {list(ibm_information)}")
print(f"The dictionary currently has {len(ibm_information)} keys.")

You can also check to see whether a key is in the dictionary using the 'in' / 'not in' *binary operators*:

In [None]:
print(f"Does the ibm_information variable have the number of employees in it? {'num_employees' in ibm_information}")
print(f"Did we already delete IBM's 2020 revenue from the dictionary? {'2020_revenue' not in ibm_information}")

####2.2.2.2. Working With Multiple Entries
---

We don't often work with multiple entries at a time using dictionaries in the same way as we do with lists (e.g., slicing). However, one time that we do so is to update one dictionary's keys and values using another dictionary's keys and values.

To do this, we use the dictionary **to-be-updated**'s `update()` method and pass it the **updating** dictionary as data. In other words:

`to_be_updated_dict.update(doing_the_updating_dict)`

Let's give it a shot:

In [None]:
about_apple = {"employees": 137000,
               "revenue": 260174000000,
               "ceo_name": "Tim Cook",
               "ticker": "AAPL",
               "last_updated": "December 31, 2019"}
print(f"As of the last update on {about_apple['last_updated']}, Apple had {about_apple['employees']} employees, made ${about_apple['revenue']} over the past year, and was led by CEO {about_apple['ceo_name']}.\n")



apple_updates = {"employees": 147000,
                 "revenue": 274515000000,
                 "last_updated": "December 31, 2020"}

# We'll update the about_apple dictionary with the updates stored in apple_updates
about_apple.update(apple_updates)
print(f"As of the last update on {about_apple['last_updated']}, Apple had {about_apple['employees']} employees, made ${about_apple['revenue']} over the past year, and was led by CEO {about_apple['ceo_name']}.\n")

While we may not frequently work with subsets of the dictionary entries like we do in lists and tuples. We'll actually find that we want Python to tell us ALL of the dictionary keys and values on a pretty regular basis. For this we'll use three dictionary methods:

1.   `keys()` - tell us all the **keys** for which there are entries in the dictionary (similar to using `list()` on the dictionary)
2.   `values()` - tell us the **values** for which there are entries in the dictionary
3.   `items()` - tell us all the **key-value combinations** for which there are entries in the dictionary

Let's take a look at each of these in turn.

In [None]:
# Two ways of looking at the keys:
print(f"Using the list() function: {list(about_apple)}")

print(f"\nUsing the keys() method: {about_apple.keys()}")

In [None]:
# Looking at the values:
print(f"The values in the dictionary are: {about_apple.values()}")

In [None]:
# Looking at the items:
print(f"The items (entries) in the dictionary are: {about_apple.items()}")

A couple things to note here.


1.   Notice that the output from the three new methods isn't just a typical list: `[key, key, key]`, it's a list inside of something else `dict_keys([key, key, key])`. 

This is because the output from these functions is NOT *actually* a list... it's a special data type that we're not going to dive into because we're not really going to use it much.

However, we can *turn these into lists* so we can treat them like lists very easily...

In [None]:
apple_dict_values = about_apple.values()
print(f"The values in the dictionary are: {apple_dict_values}")
print(f"The data type of this is {type(apple_dict_values)}")

print(f"\nIs that the same as being a list? {isinstance(apple_dict_values, list)}")

# Let's turn it into a list then!
apple_dict_values = list(apple_dict_values)
print(f"\nIs it a list now? {isinstance(apple_dict_values, list)}")

print(f"\nThe new values in the dictionary are: {apple_dict_values}")
print(f"If this is a list now, I should be able to take a slice of it: {apple_dict_values[0:2]}")

The reason I did this is that lists don't like being told that they're strings. If you tell a list to convert itself to a string, it will kick back an error.

##2.3. Working with Modules
---

Technically, if we wanted to write a **LOT** of code, we could probably build everything from scratch and not use code created by anyone else. However, imagine doing something complex like running a regression model or structural equations model from scratch! It's inefficient.

There's a popular belief that computer programming is a bunch of geniuses writing code from scratch. In reality, most programmers - like academics - *stand on the shoulders of giants* and reuse other people's code *all the time*! 

This practice is so common, in fact, that there are over 200 'modules' that come by default with Python and *many thousands* more that you can access with two lines of code. Each of these modules may include many different functions and data types like those we've been using so far.

In this section, we're going to briefly look at how to access these modules.

###2.3.1. Importing Modules and Using Their Components
---

For the time being, we're going to take for granted that you have installed the modules you want to work with. We'll revisit this assumption in the next section when we'll discuss installing modules.

The first example we're going to work with here is a module called `pprint`, which stands for 'pretty print'.

You probably noticed in the last sections when we were printing out lists, tuples, and dicts that the output was fairly ugly and difficult to read. The pprint module seeks to make printing these out a bit easier to read.

The most basic way to import a module into your code is to... well... `import` it:

In [None]:
import pprint

print("The pprint module is now available for us to use.")

When you use `import` like this, any time you want to use one of its components, you need to include the module name when you call it.

For instance, one of the main functions in the pprint module is actually `pprint()`. Let's use it.

*I know, the function has the same name as the module, a bit confusing, but this isn't always the case*

In [None]:
print("Here's the generic print version:")
print(about_apple)

print("\nHere's the pretty print version, with each key alphabetized and nicely aligned:")
pprint.pprint(about_apple)

Notice how we had 'pprint' twice, separated by a '.'?

The 'pprint' on the left tells Python to look in the 'pprint' module for the function.

The 'pprint()' on the right tells Python what function to look for in the module.

Putting it together: `pprint.pprint(about_apple)` tells Python to send the dictionary 'about_apple' to the pprint function in the pprint module.

To drive this home, the pprint module also has a `pformat()` function that instead of pretty printing to the screen returns the pretty formatting so that it can be stored in a variable for later.

Let's give that a whirl:

In [None]:
pretty_apple = pprint.pformat(about_apple)
print(pretty_apple)

Cool, let's try another, just for practice. Let's import the `math` module.

Python has lots of basic math operations available just by default. But once you get beyond the basics, you will need to import modules to get more 'advanced' math functions and variables.

For instance, if you want the variable `pi` or the natural log function (`log()`), those can be obtained in the math module.

Let's access them:

In [None]:
import math

# Let's print out the value of pi
print(f"The value of pi is {math.pi}")

# Calculate and print the area of a circle with a radius of 4:
radius = 4
area = math.pi * radius ** 2
print(f"The area of a circle with radius {radius} is {area}")

In [None]:
#Using the natural log function of the math module with our value of apple's revenue stored in the dictionary
apple_revenue_ln = math.log(about_apple["revenue"])

print(f"Apple's revenue is {about_apple['revenue']}")
print(f"The natural log of Apple's revenue is {apple_revenue_ln}")

In our research we often take the natural log of size variables to address skewness in the data - here we've demonstrated how to do that using `math.log()`.

We also see here that sometimes these modules contain variables in addition to functions. For example, here we accessed the `math.pi` variable.

OK, so we've seen one method of importing a module... let's look at two other common ways that can simplify our lives further (in some cases).

The first method is to use `from ___ import ___` instead of `import ___`.

For example, we could do `from math import pi` or `from pprint import pformat`.

This is helpful if you know you're only going to use one or few functions/variables from that module and you don't want to have to use the module name each time you use it/them.

Let's see an example:

In [None]:
# Because we use "import math" we have to specify that the pi variable came from that module
import math
area = math.pi * radius ** 2

# Because we told Python we specifically want to import the pi variable from math, we don't have to repeat that it's from the math module in the future.
from math import pi
area = pi * radius ** 2

# You can use commas if you want to bring more than one variable/function in from the module (and you don't need parentheses to bring in functions)
from math import pi, e, tau, log
print(f"The value of pi is: {pi}, the value of e is: {e}, the value of tau is: {tau}, the natural log of e is {log(e)}.")

A couple of things to note here:

*   You only need to import things once and you generally do so at the top of your program. Once you do so, Python will remember that import statement and go to that module whenever needed from then on. The only reason I'm doing so here is to make the distinctions easier to learn.
*   The `from ____ import ____` format is handy, but you have to remember that this is ONLY going to give you access to the variables and functions from that module that you specify. Whereas with `import ____` you'll have access to all of them (but have to use the module name to access them).
*   When you use `from ____ import ____` you're telling Python that you want those variables and functions to live in your program by those names. If you then, say, accidently overwrite the 'pi' variable for something else in your program, the math module version of 'pi' is gone (temporarily). With `import ____` you won't have to worry about that because you're less likely to assign `math.pi` a different value (it's technically possible, but you're less likely to do that on accident).

Let's see an example of that last point...



In [None]:
from math import pi
print(f"The area of a circle with radius {radius} is {pi * radius ** 2}")

pi = "cherry pie "
print(f"The area of a circle with radius {radius} is {pi * radius ** 2}")

# Remember how a string multiplied by a number repeats the string that number of times?
# pi is now "cherry pie "
# radius ** 2 = 16
# so pi * radius ** 2 will now... well, go ahead and run the code.

When we use a module name a LOT or the module name we are importing is pretty long, sometimes we like to give it a short alias so that we don't have to type out the module name over and over again.

Oftentimes, there are 'norms' for the aliases we give these modules. For example:
* The `pandas` module (used to work with entire tables of information like we do in other statistical software) is normally aliased `pd`
* The `numpy` module (used to work with arrays and matrices) is normally aliased `np`
* The `matplotlib.pyplot` module (used to create charts) is normally aliased `plt`

You certainly don't *have* to follow these norms, but as you work with other people on your code and/or search Google for example code you can adapt to your purposes, not using the norm will create extra work and confusion. It's probably best to stick with the norms.

Now let's actually do it. To alias our imports we use `import ______ as ______`.

Let's import the numpy module two ways: with and without its alias, and use it to create an array using both approaches.

In [None]:
# Create a list of numbers from 0 through 999,999
list_of_numbers = list(range(0,1000000))

# Without the alias
import numpy
array_of_numbers = numpy.array(list_of_numbers)
print(f"The last 3 numbers in the 'numpy' version of the array are: {array_of_numbers[-1:-4:-1]} and the array type is {type(array_of_numbers)}")

# With the np alias
import numpy as np
array_of_numbers = np.array(list_of_numbers)
print(f"The last 3 numbers in the 'np' version of the array are: {array_of_numbers[-1:-4:-1]} and the array type is {type(array_of_numbers)}")

You can see that the result is the same behind the scenes - Python is truly just using 'np' as an alias for the 'numpy' module.

###2.3.2. Installing Packages
---

One of the nice things about Google Colaboratory is that most of the module's we want to use are already installed for us. So if you want to skip this section, you can and you won't miss anything important for this series.

However, if you decide to take what you learn in this series and try to replicate it on your local computer. You may find that you're missing some important modules. In this next section, we'll address how to get those modules ready for use.

####2.3.2.1. Installing Packages in Google Colaboratory
---

Google Colaboratory has a ton of 'extra' modules already installed in it. So it's pretty rare that you'll need to install a new one as part of this series. However, if you find that you do, it's pretty easy.

The code for installing a package (a package is a collection of modules) is:

`!pip install ______________`

So for instance, if we wanted to install a pretty stellar text analysis tool called 'stanza', we could run the code:

In [None]:
!pip install stanza

You generally don't install packages from within Python, you generally do it from the command prompt. However, you actually can have Python do it for you by typing the command after an exclamation point as we do above.

In other words, `pip install stanza` is NOT Python code. We're having Python tell the operating system to execute that command.

####2.3.2.2. Installing Packages on Your Local Machine
---


On your local machine you would install packages much the same way as you did above for Google Colaboratory. However, generally on your own computer you would do so from the command prompt rather than from within Python.

So you would go to the command prompt and type in `pip install stanza` in this case.

There are caveats to this, many of them in fact. For instance, the way you install packages can change based on:

*   If you're using a 3rd-party python platform - [Anaconda](http://www.anaconda.org) is a common one for data science
*   The package you're installing is not stored in the main Package Repository
*   The package you're installing is in GitHub rather than a repository
*   The module you want to use is just someone else's Python file with functions in it and they pass it to you on a jump drive.

There's too many variations than can be addressed here. However, the answers are [out there](https://packaging.python.org/tutorials/installing-packages/).

##2.4. A Few 'Other' Data Structures
---

There are a **ton** of data structures we're not going to talk about. However, in this series, we're going to come across three that it's worth being familiar with because we'll be using them to show different aspects of text analysis.

These are:

* The Collections.Counter object
* The NumPy Array object, and
* The Pandas DataFrame object.

Let's be clear, there are [entire books](https://www.manning.com/books/pandas-in-action) just on how to use pandas DataFrames. We're not even going to scratch the surface here.

Mostly, I want you to be aware of:

*   Their existence
*   What they are/do
*   One way of creating each (of many possible ways)

###2.4.1. Collections Counters
---

The `Collections` module comes with Python itself, you'll never have to download it. This module contains a number of more specialized data structures than the lists/dictionaries/tuples we discussed above.

Here, we're only going to care about the `collections.Counter` data structure. I like to think about the Counter as a dictionary... but a dictionary with a more specialized purpose. 

Whereas in regular dictionaries, each key can have a value of any data type; in Counters, the value is an integer count for that key.

Let's look at an example, where we count the number of times individual numbers appear in a list.

In [None]:
import pprint
from collections import Counter

students = ["Angela", "Blake", "David", "Stephanie", "Angela", "Miles", "Kelly", "Emiko", "Ciara", "Stephanie", "Terri", "Darius", "Stephanie"]
student_counter = Counter(students)

pprint.pprint(student_counter)

We can see from the output that it's structured a *lot* like the dictionaries we saw previously. However, look at the values. We never assigned numbers to each of the keys explicitly.

This data structure has a counting mechanism built into it such that whenever the key is added, it updates the counter for that key.

Let's see why this might be valuable in text analysis.


In [None]:
# One paragraph from the 2020 Microsoft shareholder letter
msft_shareholder_letter = "More than ever, organizations are relying on Azure to stay up and running and support critical workloads, from healthcare triage with AI-assisted bots, to digital twins in manufacturing, to e-commerce in retail. Today, leaders in every industry—including 95 percent of the Fortune 500—run on Azure. We are building Azure as the world’s computer to support them, with more datacenter regions than any other provider— now 61. Fifty billion devices will come online by 2030, and Azure is the only cloud that extends to the edge, with consistency across operating models, development models, and infrastructure stack. Azure Arc enables organizations to deploy Azure services anywhere and extend Azure management to any infrastructure. Azure Stack Edge brings rapid machine learning inferencing closer to where data is generated, including the harshest of conditions, like disaster response. Our acquisitions of Affirmed and Metaswitch, along with new Azure Edge Zones, expand our offerings for telecom operators as they move to 5G. And, with Azure Orbital, we’re taking our infrastructure to space, enabling anyone to access satellite data and capabilities from Azure."

# Break the paragraph into individual words and remove punctuation
word_list_msft = msft_shareholder_letter.replace(".", "").replace(",", "").replace("-"," ").replace("—"," ").replace("  ", " ").split(" ")

# Don't worry about how this line of code works quite yet, we'll be covering 'list comprehensions' in the next module and this line of code will become clear then.
# For now, just know that this line removes words with less than 5 letters from the list to keep our code output manageable
word_list_msft = [word for word in word_list_msft if len(word) >= 5]

# Create the counter
msft_word_counter = Counter(word_list_msft)

# Print it!
pprint.pprint(msft_word_counter)

One of the simplest forms of text analysis is word counting. But simple though it may be, word counting can tell us valuable information about the text author. 

In this case we see the word 'Azure' appears 11 times in the text. This may suggest something about Microsoft's priorities/attention in 2020-2021.

By using the Counter data structure, in one line of code we went from having a list of words to a count of how many times each word appears in the text. Convenient!

---

While we're at it, let's talk about that second line of code:

`word_list_msft = msft_shareholder_letter.replace(".", "").replace(",", "").replace("—"," ").replace("  ", " ").split(" ")`

That looks like a monster, but if we pick it apart piece by piece, we'll see that it's all basic things we've seen before. 

* `word_list_msft = ` - Take whatever is on the right of the equals sign and assign it to the variable word_list_msft.

* `msft_shareholder_letter.replace(".", "")` - Find all the periods in msft_shareholder_letter and replace them with blanks

* `.replace(",", "")` - Find all the commas in the string and replace them with blanks

* `.replace("—"," ")` - Find all dashes in the string and replace them with spaces

* `.replace("  ", " ")` - Find all double spaces in the string and replace them with single spaces

* `.split(" ")` - Split the string up into a list of individual strings based on where you find spaces. In other words, create a list of the words in the string.

This is called *method chaining*. It is basically the same as doing the below, but has the benefit of being quite a bit easier to read and you don't have to create a third variable to hold the intermediate results.

In [None]:
temporary_storage = msft_shareholder_letter.replace(".", "") # Find all the periods in msft_shareholder_letter and replace them with blanks
temporary_storage = temporary_storage.replace(",", "") # Find all the commas in the string and replace them with blanks
temporary_storage = temporary_storage.replace("—"," ") # Find all dashes in the string and replace them with spaces
temporary_storage = temporary_storage.replace("  ", " ") # Find all double spaces in the string and replace them with single spaces
word_list_msft = temporary_storage.split(" ") # Split the string up into a list of individual strings based on where you find spaces. In other words, create a list of the words in the string.

del temporary_storage # Delete the temporary storage variable, we only needed this as a place to temporarily hold the results from each stage

With *method chaining*, just remember that Python works from left-to-right. The results of what happened on the left will be used with the next procedure on the right, and so on down the chain.

We've already seen the function equivalent of method chaining, *nested functions*:

In [None]:
print(len(word_list_msft))

Here we have a function inside of a function and we work from the inside out, just as you would with order of operations in math (remember ***P***-E-M-D-A-S?)

1. `word_list_msft` - our list of words
2. `len()` - find the length of that list: 95
3. `print()` - print that number to the screen

###2.4.2. NumPy Arrays
---

We've already seen that Python has a basic data structure for sequences of data in lists and tuples. Another, more advanced way of representing a sequence of data is in a NumPy array (technically, `numpy.ndarray` in Python).

The motivation here is that lists and tuuples are fantastic for managing modest amounts of data; however, they were implemented for flexibility, not for speed. Accordingly, if you were to load millions of tweets into a list Python would start slowing down on you. We're not going to get into the [technical reasons](https://towardsdatascience.com/how-fast-numpy-really-is-e9111df44347) why... but I will show you: 

In [None]:
import numpy as np

# Create a list of numbers from 0 through 999,999
list_of_numbers = list(range(0,1000000))

print("Sum a numbers from 0 through 999,999 stored in a list:")
%timeit -n 100 sum(list_of_numbers)

#Create the NumPy array from the list
array_of_numbers = np.array(list_of_numbers)

print("\nSum a numbers from 0 through 999,999 stored in a NumPy array:")
%timeit -n 100 array_of_numbers.sum()

If your results are similar to mine, you can see that summing a list of numbers from 0 through 999,999 was over 10x slower in a list than in a NumPy array. Now imagine doing things much more complicated than taking a sum of consecutive integers and with potentially a lot more data... that NumPy array is looking pretty good to me.

But there is a tradeoff. Again, I mentioned that lists are slower because they were created for flexibility. Well, one way in which NumPy arrays are a little less flexible is that the contents of the array must be homogeneous.

You cannot have a string in one cell, an integer in the next and a float in the following. If the array is going to be an integer array, all of the contents must be integers. Fortunately, for what we are doing that's not going to be a very limiting constraint, so we can take advantage of the faster data structure.

###2.4.3. Pandas DataFrames
---

The last data structure we'll discuss is the pandas DataFrame. The pandas package is one of the quantitative methods workhorses in Python data science. Put simply, whereas NumPy works with individual arrays of data, the `pandas.DataFrame` data structure assembles these arrays together into entire datasets much like those we use in Stata/SPSS/SAS/R/etc.

Here too, we can approximate a pandas DataFrame using the basic structures available to us. And we can even approximate them using multi-dimensional NumPy arrays. However, Pandas is optimized for working with data as a dataset and comes with a suite of methods for importing, wrangling with, analyzing, visualizing, and exporting data that makes data analysis much easier.

Now pandas is no panacea, there are packages that are even faster as you get a bigger and bigger dataset (see [Dask](https://dask.org/) and [Vaex](https://vaex.io/docs/index.html)). However, for this demonstration, we're going to be using the Python data science darling, pandas.

Let's play with pandas!

In [None]:
# pandas' conventional alias is pd
import pandas as pd

# Some data for our dataset
about_apple = {'ceo_name': 'Tim Cook', 'employees': 147000, 'revenue': 274515000000, 'ticker': 'AAPL', 'last_updated': 'December 31, 2020'}
about_ibm = {'ceo_name': "Arvind Krishna", 'employees': 345900, 'revenue': 73600000000, 'ticker': 'IBM', 'last_updated': 'December 31, 2020'}
about_microsoft = {'ceo_name': "Satya Nadella", 'employees': 96000, 'revenue': 143000000000, 'ticker': 'MSFT', 'last_updated': 'December 31, 2020'}
about_alphabet = {'ceo_name': "Sundar Pichai", 'employees': 135301, 'revenue': 182530000000, 'ticker': 'GOOGL', 'last_updated': 'December 31, 2020'}
about_amazon = {'ceo_name': "Jeff Bezos", 'employees': 1300000, 'revenue': 125560000000, 'ticker': 'AMZN', 'last_updated': 'December 31, 2020'}
dataset = [about_apple, about_ibm, about_microsoft, about_alphabet, about_amazon]

# Create the pandas DataFrame
tech_df = pd.DataFrame(dataset).set_index('ticker')

# Print out the first few rows of the DataFrame
tech_df.head()

Once we have the dataframe loaded, we can view descriptive statistics by column

In [None]:
tech_df.describe()

We can look at, create, and manipulate the contents of individual columns

In [None]:
# View just the revenue column
print(tech_df['revenue'])

In [None]:
# Just view the Microsoft row
print(tech_df.loc['MSFT'])

In [None]:
# Create new columns based on the value of the existing columns
tech_df['ln_employees'] = np.log(tech_df['employees'])
tech_df.head()

In [None]:
# Modify the contents of an entire column - in this case, z-score standardization
tech_df['revenue'] = (tech_df['revenue'] - tech_df['revenue'].mean()) / tech_df['revenue'].std()
tech_df.head()

Again, this doesn't even scratch the surface of what Pandas can do. But for our purposes, this level of familiarity is sufficient.

##2.5. Practice Exercises
---

2.5.1. Create two variables: `corpus_size`, an integer with value 500; and `words_per_text`, a float with value 175.32.

Find the total number of words in the corpus and assign it to variable `total_words`. Have Python tell you whether it is an integer. If it's not, have Python convert it to one.


2.5.2. Create a string variable: `test_sentence` and give it the value "This is my first test sentence for CATA analysis."

Have Python:
*   Replace all periods in the sentence with exclamation marks. (Print out the new sentence to verify)
*   Change the entire string to lowercase (Print out the new sentence to verify)
*   Tell you how mant times the two-letter-combination "is" appears in the sentence, and
*   Tell you how many characters long the sentence is.


2.5.3. Have Python tell you what the 'truthiness' of an empty list is. How about for a list with three elements in it?

2.5.4. Create two tuple variables containing ('apple', 'banana', 'cucumber') and ('dal', 'eggplant', 'fettuccine'). 

Combine them into a new 6-item long tuple.

Can you predict what is in index number 2? Have Python tell you what is there.

Have Python tell you the index number for the 'dal' entry.

2.5.5. Create a list of the numbers 1-10 in random order and store it in the variable `number_list`. 

Have Python sort it for you and print it out to confirm that it's correctly sorted. 

Once it is sorted, have Python print out every 2nd number from 3 to 9.

Delete the number 5 from the list and have Python print the updated list to show that it's no longer there.

2.5.6. Create a dictionary for yourself (make the variable name `about_me`). Make sure to include some relevant entries (age, profession, number of children, etc.) in the dictionary when you set it up, but don't include a 'zip_code' key. 

Once it's set up, have Python tell you whether you provided the 'zip_code' key or not (it should return True or False).

Try to access the 'zip_code' key anyway, but do so in a way that returns "-missing-" if it's not there

Add the 'zip_code' key to the dictionary and have Python now tell you what the 'zip_code' value is.

Print out the list of all entries in the dictionary

2.5.7. You have a theory that businesses evolve based on the golden ratio. Have Python print the `golden_ratio` variable from the `scipy.constants` module.

2.5.8. What will the contents of the following `mystery_structure' be?

```
from collections import Counter

grades = ['a', 'b', 'c', 'b', 'c', 'a', 'a', 'b', 'b', 'a', 'c']
new_grades = ['a', 'd', 'b', 'c', 'b', 'b', 'b', 'a', 'a']
grades.extend(new_grades)

mystery_structure = Counter(grades)
print(mystery_structure)
```

Test the code to see if you are correct.

2.5.9. What is the difference between:

1. A list
2. A NumPy array
3. A pandas DataFrame?

2.5.10. *Text analysis challenger problem*

You are given the following sentence:

`"Let's do some Computer-Aided Text Analyses. I can't wait to see the text analytic results!"`

Preprocess this text such that it becomes this list of words:

`['let', 'us', 'do', 'some', 'computer', 'aided', 'text', 'analyses', 'I', 'can', 'not', 'wait', 'to', 'see', 'the', text', 'analytic', 'results']`

Use a Counter to have Python tell you what the most commonly used words are.

Shoot to do this in 8 lines of code or less.

##2.6. Hints for Practice Exercises
---

2.5.1. Create two variables: `corpus_size`, an integer with value 500; and `words_per_text`, a float with value 175.32.

Find the total number of words in the corpus and assign it to variable `total_words` and print it out. Have Python tell you whether it is an integer. If it's not, have Python convert it to one. Confirm that it is now an integer

Hints:

*   `=` is the assignment operator and `*` multiplies numbers
*   The `type()` function identifies the data type of the data you pass to it.
*   The `isinstance()` function confirms whether the data you pass is of a certain type
*   The `int()` function turns the data you pass to it into an integer



2.5.2. Create a string variable: `test_sentence` and give it the value "This is my first test sentence for CATA analysis."

Have Python:
*   Replace all periods in the sentence with exclamation marks. (Print out the new sentence to verify)
*   Change the entire string to lowercase (Print out the new sentence to verify)
*   Tell you how mant times the two-letter-combination "is" appears in the sentence, and
*   Tell you how many characters long the sentence is.

Hints:
*   The `replace()` method replaces bits of a string with other bits, but it doesn't save the results unless you assign the results back to the variable
*   The `lower()` method changes everything to lowercase, but it doesn't save the results unless you assign the results back to the variable
*   The `count()` method counts the number of time a substring appears in a string
*   The `len()` *function* identifies the length of the data you pass to it.

2.5.3. Have Python tell you what the 'truthiness' of an empty list is. How about for a list with three elements in it?

Hints:

*   Recall that 'truthiness' refers to the boolean representation of a variable
*   The `bool()` function converts data to boolean.
*   Lists are offset by square brackets, so empty lists have nothing between the brackets
*   Lists are delimited by commas

2.5.4. Create two tuple variables containing ('apple', 'banana', 'cucumber') and ('dal', 'eggplant', 'fettuccine'). Combine them into a new 6-item long tuple. Can you predict what is in index number 2? Have Python tell you what is there. Have Python tell you the index number for the 'dal' entry.

Hints:

*   Tuples are offset by parentheses and delimited by commas
*   Remember that computers count starting with zero
*   Use square brackets after the variable name to select just one item from a tuple based on its index
*   Use the `index()` method to find the index of a specific entry



2.5.5. Create a list of the numbers 1-10 in random order and store it in the variable `number_list`. Have Python sort it for you and print it out to confirm that it's correctly sorted. Once it is sorted, have Python print out every 2nd number from 3 **through** 9. Delete the number 5 from the list and have Python print the updated list to show that it's no longer there.

Hints:

* Lists are offset by square brackets and delimited by commas
* The `sort()` method sorts lists in-place (no need to assign to a new variable)
* Slicing uses `start:stop:step` notation to show subsets of a list
* The `remove()` method removes an item from a list based on the data
* The `pop()` method removes an item from the list based on the index

2.5.6. Create a dictionary for yourself (make the variable name `about_me`). Make sure to include some relevant entries (age, profession, number of children, etc.) in the dictionary when you set it up, but don't include a 'zip_code' key. Once it's set up, have Python tell you whether you provided the 'zip_code' key or not (it should return True or False). Try to access the 'zip_code' key anyway, but do so in a way that returns "-missing-" if it's not there. Add the 'zip_code' key to the dictionary and have Python now tell you what the 'zip_code' value is. Print out the list of all entries in the dictionary.

Hints:
* Dictionaries are offset by braces and delimited by commas
* Each entry in a dictionary is of form: `key: value`
* The `in` and `not in` binary operators confirm what is/n't in a dictionary
* The `get()` method has an option to return a default value when a key is missing
* Use square brackets after the variable name to add/modify a dictionary's contents (or you can use the `update()` method)
* The `items()` method will return all of a dictionaries entries 

2.5.7. You have a theory that businesses evolve based on the golden ratio. Have Python print the `golden_ratio` variable from the `scipy.constants` module.

Hints:

* There are several ways of importing a module:
  * `import _______`
  * `import _______ as ________`
  * `from _______ import _______`
* You're getting a variable, so no parentheses are needed after 'golden_ratio'

2.5.8. What will the contents of the following `mystery_structure' be?

```
from collections import Counter

grades = ['a', 'b', 'c', 'b', 'c', 'a', 'a', 'b', 'b', 'a', 'c']
new_grades = ['a', 'd', 'b', 'c', 'b', 'b', 'b', 'a', 'a']
grades.extend(new_grades)

mystery_structure = Counter(grades)
print(mystery_structure)
```

Hints:
* The `extend()` method merges the second list into the first list
* Counter objects are like dictionaries, but the values they hold are counts

2.5.9. What is the difference between:

1. A list
2. A NumPy array
3. A pandas DataFrame?

Hints:

* Are the contents more restrictive in some than than others?
* Are some generally faster than others?
* Are some better structured for full datasets multidimensional and have data ingestion/manipulation/analysis/exportation capabilities built in?

2.5.10. *Text analysis challenger problem*

You are given the following sentence: `"Let's do some Computer-Aided Text Analyses. I can't wait to see the text analytic results!"` 

Preprocess this text such that it becomes this list of words: `['let', 'us', 'do', 'some', 'computer', 'aided', 'text', 'analyses', 'I', 'can', 'not', 'wait', 'to', 'see', 'the', text', 'analytic', 'results']`

Use a Counter to have Python tell you what the most commonly used words are. Shoot to do this in 8 lines of code or less.

Hints:
* The `replace()` method can replace more than one letter/character at a time. Could this help with the contractions?
* The `split()` method turns strings into lists
* Method chaining and function nesting helps combine lines of code that all work with the same data.

##2.7. Solutions to Practice Exercises
---

In [None]:
# 2.5.1. Calculating total words.

corpus_size = 500
words_per_text = 175.32
total_words = corpus_size * words_per_text
print(f"total_words = {total_words}")

# Option 1
print(f"total_words is a(n): {type(total_words)}")
# Option 2
print(f"Is total_words an integer? {isinstance(total_words, int)}")

total_words = int(total_words)
print(f"Is total_words an integer now? {isinstance(total_words, int)}")

In [None]:
# 2.5.2. Exploring a test sentence.

test_sentence = "This is my first test sentence for CATA analysis."

test_sentence = test_sentence.replace(".", "!")
print(f"With exclamation marks: {test_sentence}")

test_sentence = test_sentence.lower()
print(f"All lowercase: {test_sentence}")

is_count = test_sentence.count("is")
print(f"\"is\" appears {is_count} times in the sentence")

sentence_length = len(test_sentence)
print(f"The sentence is {is_count} characters long")

In [None]:
# 2.5.3. The Truthiness of lists.

empty_list = []
three_item_list = [0, 1, 2]

print(f"The truthiness of an empty list is {bool(empty_list)}")
print(f"The truthiness of an three-item list is {bool(three_item_list)}")

In [None]:
# 2.5.4. Tuple manipulation

tuple_a = ('apple', 'banana', 'cucumber')
tuple_b = ('dal', 'eggplant', 'fettuccine')
combined_tuple = tuple_a + tuple_b

# Cucumber will be in index two
print(f"{combined_tuple[2]} is in index two.")
print(f"\"dal\" is in index {combined_tuple.index('dal')}")

In [None]:
#2.5.5. List manipulation

number_list = [5,3,7,1,10,9,2,6,8,4]
number_list.sort()
print(f"The sorted list is {number_list}")
print(f"Three through nine by twos: {number_list[2:9:2]}")

# option 1
number_list.pop(4)
# option 2
# number_list.remove(5)

print(f"Without number 5 listed: {number_list}")


In [None]:
#2.5.6. Dictionary manipulation

about_me = {"age": 37,
            "profession": "professor",
            "num_children": 1,
            "married": True}

print(f"Is \"zip_code\" a key in the dictionary? {'zip_code' in about_me}")
print(f"Aaron's zip code is: {about_me.get('zip_code', '-missing-')}")
about_me["zip_code"] = 47401
print(f"Aaron's zip code is: {about_me.get('zip_code', '-missing-')}")
print(f"The entries in the dictionary are: {about_me.items()}")

In [None]:
#2.5.7. Accessing modules

# option 1
import scipy.constants
print(f"The golden ratio is {scipy.constants.golden_ratio}")

# option 2
import scipy.constants as spc
print(f"The golden ratio is {spc.golden_ratio}")

# option 3
from scipy.constants import golden_ratio
print(f"The golden ratio is {golden_ratio}")

In [None]:
#2.5.8. Counter Objects

from collections import Counter
 
grades = ['a', 'b', 'c', 'b', 'c', 'a', 'a', 'b', 'b', 'a', 'c']
new_grades = ['a', 'd', 'b', 'c', 'b', 'b', 'b', 'a', 'a']
grades.extend(new_grades)
 
mystery_structure = Counter(grades)
print(mystery_structure)

2.5.9. lists vs arrays vs DataFrames.

Lists:
* Very flexible: can hold myriad different types of data at a time
* Comparatively slow owing to this flexibility

NumPy Arrays:
* Restrictive: can hold only one type of data in a single array
* Efficient/fast owing to this restrictiveness

pandas DataFrames:
* Better suited for multidimensional data (e.g., the datasets we work with in our statistical software where we have many pieces of data per record)
* Lots of helper functions for data ingestion/manipulation/analysis/visualization/exportation

2.5.10. *Text analysis challenger problem*

Save the following to a variable `original_text`: 

`"Let's do some Computer-Aided Text Analyses. I can't wait to see the text analytic results!"`

Preprocess this text and store the results into the variable `preprocessed_text`:

`['let', 'us', 'do', 'some', 'computer', 'aided', 'text', 'analyses', 'I', 'can', 'not', 'wait', 'to', 'see', 'the', text', 'analytic', 'results']`

Use a Counter to have Python tell you what the most commonly used words are.

In [None]:
#2.5.10. Text analysis challenger problem

# In 8 lines of code - pretty ideal, each major step is one line of code.
from collections import Counter
original_text = "Let's do some Computer-Aided Text Analyses. I can't wait to see the text analytic results!" # Original string
preprocessed_text = original_text.lower() # Standardize case
preprocessed_text = preprocessed_text.replace("'s", " us").replace("'t", " not") # Expand contractions
preprocessed_text = preprocessed_text.replace("!", "").replace(".", "") # Remove punctuation
preprocessed_text = preprocessed_text.split(" ") # Split the string into a list of words
word_counter = Counter(preprocessed_text) # Create the Counter
print(word_counter) # Print the word counts


# In 4 lines of code - readability is getting a little iffy here, but it is still organized.
from collections import Counter
original_text = "Let's do some Computer-Aided Text Analyses. I can't wait to see the text analytic results!" # Original string
preprocessed_text = original_text.lower().replace("'s", " us").replace("'t", " not").replace("!", "").replace(".", "").split(" ") # Use method chaining to combine preprocessing steps
print(Counter(preprocessed_text)) # Use function nesting to combine the counter creation and printing


# In 2 lines of code - don't do this, I'm showing you what is possible... but too much of a good thing is a bad thing. This is just... ugly.
from collections import Counter
print(Counter("Let's do some Computer-Aided Text Analyses. I can't wait to see the text analytic results!".lower().replace("'s", " us").replace("'t", " not").replace("!", "").replace(".", "").split(" ")))