In this project, we'll learn how computers store values in memory.

Our data set contains excerpts from CIA memos that detail covert activities. It includes the year the statement was made, then an excerpt from the memo.

![image.png](attachment:image.png)

The file consists of one long string. To use it effectively, we'd need to parse it and convert it into rows and columns. We've covered strings extensively so far, but we haven't covered how a computer stores them.

Computers store files on hard drives. A hard drive allows us to save data, turn the computer off, and then access the data again later. The tech community commonly refers to hard drives as magnetic storage, because they store data on magnetic strips.

Magnetic strips can only contain a series of two values - up and down. Our entire CSV file saves to a hard drive the same way. We can't directly write strings such as the letter a to a hard disk; we need to convert them to a series of magnetic ups and downs first.

We can do this with an encoding system called binary. With binary, the only valid numbers are 0 and 1. This constraint makes it easy to store binary values on a hard disk.

Computers can't store values like strings or integers directly. Instead, they store information in binary, where the only valid numbers are 0 and 1. This system makes storing data on devices like hard drives possible.

However, we normally count in "base 10." We call this system base 10 because there are 10 possible digits - 0 through 9. Binary is base two, because there are only two possible digits - 0 and 1.

To work with binary in Python, we need to enter it as a string. If we enter something like b = 10 directly, for example, Python will assume that it's a base 10 integer (rather than binary). Instead, we would need to put quotes around it to enter it as a string before working with it further.

In [1]:
# Convert the binary number "100" to a base 10 integer

b = "100"

# Now, we can convert b from a string to a binary number with the int function. 
# We'll need to set the optional second argument, base, to 2 (binary is base two).

b_10 = int(b,2)
print(b_10)

4


In [2]:
def binary_add(a, b):
    return bin(int(a, 2) + int(b, 2))[2:]

In [3]:
c = binary_add("1", "1")
c

'10'

In [4]:
# Add "10" (base 2) to c.
c = binary_add(c,"10")
c

'100'

In [5]:
a = 0 # base 10
b = "0" # base 2

# Loop 10 times
for i in range(0, 10):
    # Add 1 to each
    a += 1
    b = binary_add(b, "1")

    # Check if they are equal
    print(int(b, 2) == a)

True
True
True
True
True
True
True
True
True
True


The cool thing here is that a and b are always equal if we add the same amount to both.
This is because base 2 and base 10 are just ways to write numbers.
Counting 100 apples in base 2 or base 10 will always give us an equivalent result - we just have to convert between them.
We can represent any number in binary; we just need to use more digits than we would in base 10.

Computers store strings in binary, just like they do with integers. First, they split them into single characters, then convert those characters to integers. Finally, they convert those integers to binary and store them.

We'll look at simple characters first - the so called ASCII characters. These include all upper and lowercase English letters, digits, and several punctuation symbols.


We can use the ord() function to get the integer for an ASCII character.Then, we use the bin() function to convert to binary.
The bin function adds "0b" to the beginning of a string to indicate that it contains binary values.

 ÿ is the "last" ASCII character; it has the highest integer value of any ASCII character. This is because 255 is the highest value we can represent with eight binary digits.

In [6]:
base_10 = ord('ÿ')
base_10

255

In [7]:
binary = bin(base_10)
binary

'0b11111111'

Why is this?  

Because a single binary digit is called a bit, and computers store values in sequences of eight bits (i.e., a byte).
We might be more familiar with kilobytes or megabytes. A kilobyte is 1000 bytes, and a megabyte is 1000 kilobytes.
There are 256 different ASCII symbols, because the largest amount of storage any single ASCII character can take up is one byte.

In [14]:
# for i in range(0,255):
#     print(chr(i))

We might be wondering what happened to all of the other characters and alphabets in the world. ASCII can't handle them, because it only supports 255 characters. The tech community realized it needed a new standard, and created Unicode.

Unicode assigns "code points" to characters. In Python, code points look like this:

"\u3232"

We can use an encoding system to convert these code points to binary integers. The most common encoding system for Unicode is UTF-8. This encoding tells a computer which code points are associated with which integers.

UTF-8 can encode values that are longer that one byte, which enables it to store all Unicode characters. It encodes characters using a variable number of bytes, which means that it also supports regular ASCII characters (which are one byte each).

In [15]:
# We can initialize Unicode code points (the value for this code point is \u27F6).
code_point = "⟶"

# This particular code point maps to a right arrow character.
print(code_point)
print(ord(code_point))

⟶
10230


In [16]:
# As you can see, this takes up a lot more than 1 byte.
print(bin(ord(code_point)))

0b10011111110110


ASCII is a subset of Unicode. Unicode implements all of the ASCII characters, as well as the additional characters that code points allow.

This lets us create Unicode strings that combine both ASCII and Unicode characters.

By default, Python 3 uses Unicode for all strings, and encodes them with UTF-8. That means we can enter the Unicode code points or the actual characters.

The \u prefix means "the next four digits are a Unicode code point"

Python includes a data type called "bytes." It's similar to a string, except that it contains encoded bytes values. When we create an object with a bytes type from a string, we specify an encoding system (usually UTF-8). Then, we can use the .encode() method to encode the string into bytes.

In [17]:
# We can make a string with some Unicode values
superman = "Clark Kent␦"
print(type(superman))

<class 'str'>


In [18]:
# This tells Python to encode the string superman as Unicode using the UTF-8 encoding system
# We end up with a sequence of bytes instead of a string
superman_bytes = "Clark Kent␦".encode("utf-8")
superman_bytes

b'Clark Kent\xe2\x90\xa6'

In [19]:
print(type(superman_bytes))

<class 'bytes'>


superman_bytes prints out as Clark Kent\xe2\x90\xa6. Similar to the \u prefix for a Unicode code point, \x is the prefix for a hexadecimal character.

Just like binary is base 2 and our normal counting system is base 10, hexadecimal is base 16. The valid digits in hexadecimal are 0-9 and A-F

* A - 10
* B - 11
* C - 12
* D - 13
* E - 14
* F - 15

In hexadecimal, 9 + 1 equals A. We use hexadecimal because it represents a byte efficiently. Byte is eight bits, or eight binary digits. The highest value we can express in a byte is 11111111, or 255 in base 10. We can express the same value in two hexadecimal digits, FF.

Programmers often use hexadecimal to display bytes instead of binary because it's more compact and easier to write out.

Using .encode() converted a sequence of code points into something that looked like \xe2\x90\xa6. The three sections of this result (which the \ character separates) represent three hexadecimal bytes. The \x prefix means "the next two digits are in hexadecimal."

Two hexadecimal digits equal eight binary digits, because digits can have higher values in hexadecimal (base 16). For instance, "F" is 15 in hexadecimal, but 1111 is 15 in binary. Because it's shorter to display, and four binary digits always equal one hexadecimal digit, programs often use hexadecimal to print out values. This is purely for convenience.

In [20]:
# F is the highest single digit in hexadecimal (base 16)
# Its value is 15 in base 10
print(int("F", 16))

# A in base 16 has the value 10 in base 10
print(int("A", 16))

15
10


In [21]:
def hexadecimal_add(a, b):
    return hex(int(a, 16) + int(b, 16))[2:]

In [22]:
# Add "2" to "ea" in hexadecimal, and assign the result to hex_ea.
hex_ea = hexadecimal_add("2", "ea")

# Add "e" to "f" in hexadecimal, and assign the result to hex_ef.
hex_ef = hexadecimal_add("e", "f")


We can convert hexadecimal to binary fairly easily. We can even use the ord() and bin() functions that helped us convert code points to binary.

In [23]:
binary_aa = bin(ord("\xaa"))
binary_ab = bin(ord("\xab"))

There's no encoding system associated with the bytes data type. That means if we have an object with that data type, Python won't know how to display the (encoded) code points in it. For this reason, we can't mix bytes objects and strings together.

In [24]:
hulk_bytes = "Bruce Banner␦".encode("utf-8")

# We can't mix strings and bytes
# For instance, if we try to replace the Unicode ␦ character as a string, it won't work, because that value has been encoded to bytes
try:
    hulk_bytes.replace("Banner", "")
except Exception:
    print("TypeError with replacement")

TypeError with replacement


In [25]:
# We can create objects of the bytes data type by putting a b in front of the quotation marks in a string
hulk_bytes = b"Bruce Banner"
# Now, instead of mixing strings and bytes, we can use the replace method with bytes objects instead
hulk_bytes.replace(b"Banner", b"")

b'Bruce '

In [26]:
# Make a bytes object containing "Thor", and assign it to thor_bytes.
thor_bytes = b"Thor"
print(type(thor_bytes))

<class 'bytes'>


Once we have a bytes object, we can decode it into a string using an encoding system. We use the .decode() method to do this.

In [27]:
aquaman_bytes = b"Who knows?"
aquaman = aquaman_bytes.decode("utf-8")
print(aquaman)
print(type(aquaman))

Who knows?
<class 'str'>


In [28]:
import csv
# When we open a file, we can specify the system used to encode it (in this case, UTF-8).
f = open("sentences_cia.csv", 'r', encoding="utf-8")
csvreader = csv.reader(f)
sentences_cia = list(csvreader)

The data contains several symbols and other Unicode characters

In [29]:
# The data consists of two columns
# The first column contains the year, and the second contains a sentence from a CIA report written in that year
# Print the first column of the second row
print(sentences_cia[1][0])

1997


Having a dataframe will make processing and analysis much simpler because we can use the .apply() method.

In [30]:
f = open("legislators.csv", 'r', encoding="utf-8")
csvreader = csv.reader(f)
legislators = list(csvreader)

In [31]:
import pandas as pd

legislators_df = pd.DataFrame(legislators)

In [32]:
legislators_df.head(3)

Unnamed: 0,0,1,2,3,4,5,6
0,last_name,first_name,birthday,gender,type,state,party
1,Bassett,Richard,1745-04-02,M,sen,DE,Anti-Administration
2,Bland,Theodorick,1742-03-21,,rep,VA,


As we can see, the first row contains the headers, which we don't want (because they're not actually data)

In [33]:
legislators_df = pd.DataFrame(legislators[1:], columns=legislators[0])

In [34]:
sentences_cia = pd.DataFrame(sentences_cia[1:], columns = sentences_cia[0])

Now that we've formatted our data nicely, we need to process the strings to count term occurences.

First, though, we need to clean them up by removing extraneous symbols. We only really care about letters, digits, and spaces.

Luckily, we can check the integer code of each character using ord() to see if it's a character we want to keep.

In [35]:
# The integer codes for all the characters we want to keep
good_characters = [48, 49, 50, 51, 52, 53, 54, 55, 
                   56, 57, 65, 66, 67, 68, 69, 70, 71,
                   72, 73, 74, 75, 76, 77, 78, 79, 80, 
                   81, 82, 83, 84, 85, 86, 87, 88, 89,
                   90, 97, 98, 99, 100, 101, 102, 103, 
                   104, 105, 106, 107, 108, 109, 110, 111,
                   112, 113, 114, 115, 116, 117, 118, 119,
                   120, 121, 122, 32]

sentence_15 = sentences_cia["statement"][14]
sentence_15

'"^^\'\'^ There was also CIA reporting in 1998 that KSM was "very close" to On June 12, 2001, it was reported that "Khaled" was actively recruiting people to travel outside Afghanistan, including to the United States where colleagues were reportedly already in the country to meet them, to carry out terrorist-related activities for UBL.'

In [36]:
# Iterate over the characters in the sentence, and only take those whose integer representations are in good_characters
# This will construct a list of single characters
cleaned_sentence_15_list = [s for s in sentence_15 if ord(s) in good_characters]

In [37]:
# Join the list together, separated by "" (no space), which creates a string again
cleaned_sentence_15 = "".join(cleaned_sentence_15_list)

In [38]:
cleaned_sentence_15

' There was also CIA reporting in 1998 that KSM was very close to On June 12 2001 it was reported that Khaled was actively recruiting people to travel outside Afghanistan including to the United States where colleagues were reportedly already in the country to meet them to carry out terroristrelated activities for UBL'

In [39]:
def clean_statement(df):
    good_characters = [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 
                       65, 66, 67, 68, 69, 70, 71, 72, 73, 74,
                       75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
                       85, 86, 87, 88, 89, 90, 97, 98, 99, 100,
                       101, 102, 103, 104, 105, 106, 107, 108, 109,
                       110, 111, 112, 113, 114, 115, 116, 117, 118,
                       119, 120, 121, 122, 32]
    statement = df["statement"]
    clean_statement_list = [s for s in statement if ord(s) in good_characters]
    return "".join(clean_statement_list)

sentences_cia["cleaned_statement"] = sentences_cia.apply(clean_statement, axis = 1)    

In [40]:
# Now we need to combine the sentences and convert them to tokens.

# The eventual goal is to count up how many times each term occurs.

combined_statements = " ".join(sentences_cia["cleaned_statement"])


In [42]:
statement_tokens = combined_statements.split(" ")

We want to count how many times each term occurs in our data, so we can find the most common items.

The problem is that the most common words in the English language are ones that are relatively uninteresting to us right now -- words like **"the"**, **"a"**, and so on. These words are called stopwords - words that don't add much information to our analysis.

It's common to filter out any words on a list of known **stopwords**. What we'll do here for the sake of simplicity is filter out any words less than five characters long. This should remove most stopwords.

In [44]:
filtered_tokens = [s for s in statement_tokens if len(s) > 4]

In [45]:
from collections import Counter
filtered_token_counts = Counter(filtered_tokens)

In [48]:
# We can use the most_common method of a Counter class to get the most common items
# We pass in a number, which is the number of items we want to get

common_tokens = filtered_token_counts.most_common(4)
common_tokens

[('interrogation', 391),
 ('information', 375),
 ('REDACTED', 375),
 ('Zubaydah', 328)]

In [49]:
# Let's write a function that computes the most common terms by year.

def find_most_common_by_year(year, sentences_cia):
    data =  sentences_cia[sentences_cia["year"] == year]
    combined_statement = " ".join(data["cleaned_statement"])
    statement_split = combined_statement.split(" ")
    counter = Counter([s for s in statement_split if len(s) > 4])
    return counter.most_common(2)

In [50]:
common_2000 =  find_most_common_by_year("2000", sentences_cia)
common_2002 =  find_most_common_by_year("2002", sentences_cia)
common_2013 =  find_most_common_by_year("2013", sentences_cia)