# Demo 02: Python and Working with Data

In this notebook we'll see some more Python and some more advanced functions for working with data! As we discussed in class, data is the cornerstone of most machine learning / artificial intelligence models so we need to understand how to access and use that data.

In the last class we saw a few examples of how to store data in variables in Python, let's quickly review.

First, if we only want to store a single value, we can create a variable to hold any type of data: a number, a word, or even a whole Python program!

In [None]:
# A quick thing I forgot, this is a comment, it's words in a code block that
# aren't code. We use the little # to set them off and they should be green.
# Things that aren't green, are executable code!

new_variable = 12

print(new_variable)

Recall that making a variable creates a space in the computer memory where we can store things. When we put something new in that box, then we lose what was there... unless we explicitly tell Python to update the value of that variable.

In [None]:
new_variable = 22

print(new_variable)


In [None]:
new_variable = new_variable + 5

print(new_variable)

We also quickly saw something called a *list*, which is a type of variable that can hold other variables, in a fixed order. So if we want to keep track of a list of words where order matters, we can use a list.

In [None]:
my_words = ["this", "is", "a", "list", "of", "words"]

print(my_words)

We can add words to this list by updating it, the same as we did with the variables above.

In [None]:
my_words = my_words + ["new", "words"]

print(my_words)

Now we need one more thing that will help us out, it's a type of container called a dictionary. It let's us store lots of values with names, and get them back whenever we want.

In [None]:
# Start a dict

new_dictionary = {"students": 10,
                  "others": 20}

In [None]:
print(new_dictionary)

In [None]:
print(new_dictionary["students"])

These are very cool as they let us keep track of things in an un-ordered way (consider that lists are ordered) and we can keep all kinds of things in there, not just numbers!

In [None]:
new_dictionary["a new one"] = "this is a string"
print(new_dictionary)

So, let's pause for a second and thing about how we'd build a DFA if we wanted to count all the words in a list..
.
.
.
.
.
.
.
.




...


...


...


Let's start by iterating over the list...we're going to use a special word called enumerate that gives us both the position and the value of a list. We'll come back to these details again later.

In [None]:
for number, word in enumerate(my_words):
  print("The " + str(number) + " word in the list is: " + word)

So with the above we can start to think about how we might go over a list of words and keep track of how many times those words appear. Let's think about the DFA for this......


...

...



...



...



...


...

Yep, we need a new dictionary and we can increment all the values as we see them in the "box" or list of words. We needed a little if statement there, we'll come back to these but trust me for now this works!





In [None]:
all_words = ["this", "is", "a", "list", "of", "words", "words", "of", "list", "kiss"]
counting_dict = {} # empty dictionary

for word in all_words:
    if word in counting_dict:
        counting_dict[word] += 1
    else:
        counting_dict[word] = 1

In [None]:
print(counting_dict)

# Working with Data from the Internet!

Now consider that we want to read a lot of books... really fast. Well, we don't want to type them in one at a time, that would take to long.

Luckily we learned about ASCII and the way that text is encoded like any other data. So we might want to read a classic, like Huck Finn, which is available on a website called [Project Gutenberg](https://www.gutenberg.org/ebooks/76).

There are a lot of different versions of that file. Some are for a Kindle, some are for other devices. At the end of the day they all use binary to represent the information, just in different ways! That's why we have file extensions in the first place: to tell us how the data in the file is organized.

We're interested in the most basic type of data, with no fancy formatting or anything else, which is called UTF-8, which we learned about a few weeks ago. As we can see there is a [UTF-8 version of Huck Finn](https://www.gutenberg.org/cache/epub/76/pg76.txt)

So the first thing we need to do is download that data and store it all in a variable. For now we'll do like above and store it in one single variable.


In [None]:
# First, I just want to show you that for any value we can show it in binary,
# because this is all binary under the hood.

s = "this is a string"
' '.join("{:8b}".format(ord(c)) for c in s)

In [None]:
# or in Hex... if you go back to the lecture with the ASCII table, you'll see
# these are the same letters!

' '.join("{:02x}".format(ord(c)) for c in s)

In [None]:
# Load up URL Lib and read all the data into one variable.

from urllib.request import urlopen
huck_finn_url = "https://www.gutenberg.org/cache/epub/76/pg76.txt"
huck_finn_text = urlopen(huck_finn_url).read().decode()
print(huck_finn_text)

In [None]:
# That's not great, we need to break up all the words...

text = "here is a long string of text"
print(text)

In [None]:
# We can use the .split command to do this which will break up words at spaces

print(text.split())

In [None]:
# Let's try it with our Huck Finn text...

huck_finn_words = huck_finn_text.split()
print(huck_finn_words)

Okay, so now we have a long long string of words that we got from the internet what can we do? Well we can go grab some of the code from above that did that simple counting thing and we can reuse it! This is one of the coolest things about code, we can reuse, and reuse, and reuse, and build very powerful things out of very simple code...




In [None]:
# So let's count all the words in Huck Finn!

counting_dict = {} # empty dictionary

for word in huck_finn_words:
    if word in counting_dict:
        counting_dict[word] += 1
    else:
        counting_dict[word] = 1

In [None]:
print(counting_dict)

In [None]:
# So this is pretty good... but let's get a graph, because it's prettier..
import pandas as pd

df_words = pd.DataFrame(counting_dict.items(), columns=['Word', 'Count']).sort_values('Count', ascending=False)
df_words.head(10)

In [None]:
# It a lot of words
df_words.describe()

In [None]:
# Let's just plot the 100 most frequent words.

df_words.sort_values('Count', ascending=False)[:50].plot.bar(x='Word', y='Count', figsize=(20,10))

So that's it! We've downloaded a whole book, read it all, and plotted the frequency of all the words in that book.

For your challenge: do this for any 3 books that you choose on Project Gutenberg. You have all the code and tools that you'll need above. You only need to generate a dictionary that contains the word count for all the books you choose **combined**.

If you submit a notebook with a graph for each book independently as well as all the books combined, then we'll award **3 bonus points**.

Congratulations, we're well on our way to reading the entire internet and creating an LLM!

In [None]:
## Put all the code you need below here, you can add as many cells and steps as you want.

## The only rules are that you must save the graphs!
