# Memory and Unicode

## Introduction

### In this mission, we'll learn how computers store values in memory.

Our data set contains excerpts from CIA memos that detail covert activities. It includes the year the statement was made, then an excerpt from the memo. The file, `sentences_cia.csv`, is in CSV format. Here's a preview of the first few lines:

```python
year,statement,,,
1997,"The FBI information included that al-Mairi's brother ""traveled to Afghanistan in 1997-1998 to train in Bin - Ladencamps.""",,,
1997,"The FBI information included that al-Mairi's brother ""traveled to Afghanistan in 1997-1998 to train in Bin - Ladencamps.""",,,
```

The file consists of one long string. To use it effectively, we'd need to parse it and convert it into rows and columns. We've covered strings extensively so far, but we haven't covered how a computer stores them.<br>

Computers store files on hard drives. A hard drive allows us to save data, turn the computer off, and then access the data again later. The tech community commonly refers to hard drives as magnetic storage, because they store data on magnetic strips.<br>

Magnetic strips can only contain a series of two values - up and down. Our entire CSV file saves to a hard drive the same way. We can't directly write strings such as the letter `a` to a hard disk; we need to convert them to a series of magnetic ups and downs first.<br>

We can do this with an encoding system called binary. With binary, the only valid numbers are `0` and `1`. This constraint makes it easy to store binary values on a hard disk.<br>

On the next few screens, we'll learn how to convert string values to binary values, as well as how to manipulate binary values.



## The Basics of Binary

Computers can't store values like strings or integers directly. Instead, they store information in binary, where the only valid numbers are 0 and 1. This system makes storing data on devices like hard drives possible.<br>

However, we normally count in "base 10." We call this system base 10 because there are 10 possible digits - 0 through 9. Binary is base two, because there are only two possible digits - `0` and `1`.<br>

To work with binary in Python, we need to enter it as a string. If we enter something like b = 10 directly, for example, Python will assume that it's a base 10 integer (rather than binary). Instead, we would need to put quotes around it to enter it as a string before working with it further.<br>

Let's explore how binary numbers work.

In [1]:
# Let's say b is a binary number.  In python, we have to store binary numbers as strings.
# If we try to enter it directly as b = 10, Python will assume it's a base 10 integer.
b = "10"

# Now, we can convert b from a string to a binary number 
# with the int function. 
# We'll need to set the optional second argument, base, to 2 (binary is base two).

print(int(b, 2))

2


In [4]:
base_10_100 = int("100", 2)
base_10_100

4

## Binary Addition

We can add binary numbers together, just like we can with base 10 numbers.

In [11]:
a = 99
a += 1
b = "1"

# We'll add binary values using a binary_add function that was made just for this exercise.
# It's not extremely important to know how it works right this second.
def binary_add(a, b):
    return bin(int(a, 2) + int(b, 2))[2:]

c = binary_add(b, "1")

In [12]:
# We now see that c equals "10", 
# which is exactly what happens in base 10 
# when we reach the highest possible digit.
print(c)

# c now equals "11"
c = binary_add(c, "1")
print(c)

# c now equals "100"
c = binary_add(c, "1")
print(c)

10
11
100


In [13]:
c = binary_add(c, "10")
print(c)

110


## Converting Binary Values to Other Bases

We just saw how we can convert between bases with the `int()` function.<br>

Let's see which values in binary equal which values in base 10.

In [14]:
# Start both at 0
a = 0
b = "0"

# Loop 10 times
for i in range(0, 10):
    # Add 1 to each
    a += 1
    b = binary_add(b, "1")

    # Check if they are equal
    print(int(b, 2) == a)

True
True
True
True
True
True
True
True
True
True


* The cool thing here is that a and b are always equal if we add the same amount to both.
* This is because base 2 and base 10 are just ways to write numbers.
* Counting 100 apples in base 2 or base 10 will always give us an equivalent result - we just have to convert between them.
### We can represent any number in binary; we just need to use more digits than we would in base 10.

In [15]:
base_10_1001 = int("1001", 2)
base_10_1001

9

## Converting Characters to Binary

Computers store strings in binary, just like they do with integers. 
* First, they split them into single characters, 
* then convert those characters to integers. 
* Finally, they convert those integers to binary and store them.

We'll look at simple characters first - the so called `ASCII` characters. These include all upper and lowercase English letters, digits, and several punctuation symbols.

We can use the `ord()` function to get the integer for an ASCII character.

In [16]:
ord('a')

97

Then, we use the `bin()` function to convert to binary. 
* The bin function adds "`0b`" to the beginning of a string to indicate that `it contains binary values`.
  * This is the reason we slice `[2:]` when writing `binary_add` function!

In [17]:
bin(ord('a'))

'0b1100001'

`ÿ` is the "last" ASCII character; it has the highest integer value of any ASCII character.
* This is because 255 is the highest value we can represent with eight binary digits.

In [18]:
ord('ÿ')

255

In [20]:
# As you can see, we get eight 1's, which shows that this is the highest possible eight-digit value.
print(bin(ord('ÿ')))
print(bin(ord('ÿ'))[2:])

0b11111111
11111111


### Why is this?  
Because `a single binary digit` is called `a bit`, and **computers store values in sequences of eight bits** (i.e., `a byte`).
* You might be more familiar with kilobytes or megabytes. A kilobyte is 1000 bytes, and a megabyte is 1000 kilobytes.
* There are `256` different `ASCII` symbols, because **the largest amount of storage any single ASCII character can take up is one byte**.

In [21]:
binary_w = bin(ord("w"))
binary_bracket = bin(ord("}"))

print(binary_w[2:], binary_bracket[2:])

1110111 1111101


## Introduction to Unicode

You might be wondering what happened to all of the other characters and alphabets in the world. **`ASCII` can't handle them, because it only supports 255 characters**. 

### The tech community realized it needed a new standard, and created Unicode.

Unicode assigns `"code points"` to characters. In Python, code points look like this:

`"\u3232"`

We can use an encoding system to convert these code points to binary integers. The most common encoding system for Unicode is `UTF-8`. This encoding tells a computer which code points are associated with which integers.<br>

`UTF-8` can encode values that are longer that one byte, which enables it to store all Unicode characters. It encodes characters using a variable number of bytes, which means that it also supports regular `ASCII` characters (which are **one byte** each).

In [22]:
# We can initialize Unicode code points 
# (the value for this code point is \u27F6, 
# but you see it as a character here 
# because the Dataquest system is automatically converting it).
code_point = "⟶"

# This particular code point maps to a right arrow character.
print(code_point)

# We can get the base 10 integer value of the code point with the ord function.
print(ord(code_point))

# As you can see, this takes up a lot more than 1 byte.
print(bin(ord(code_point)))

⟶
10230
0b10011111110110


In [25]:
binary_1019 = bin(ord("\u1019"))
binary_1019

'0b1000000011001'

## Strings with Unicode

**`ASCII` is a subset of Unicode**. Unicode implements all of the ASCII characters, as well as the additional characters that code points allow.<br>

This lets us create **Unicode strings that combine both ASCII and Unicode characters**.<br>

### By default, Python 3 uses `Unicode` for all strings, and encodes them with `UTF-8`. 
That means we can enter the Unicode code points or the actual characters.

* Make a string that combines `Unicode` and `ASCII`, and assign it to `s3`.

In [27]:
s1 = "café"
# The \u prefix means "the next four digits are a Unicode code point"
# It doesn't change the value at all (the last character in the string below is \u00e9)
s2 = "café"

# These strings are the same, because code points are equal to their corresponding Unicode characters.
# \u00e9 and é are equivalent.
print(s1 == s2)

True


In [32]:
s3 = "caf"+"\u00e9"
print(s3)

café


## The `Bytes` Data Type

Python includes **a data type called "`bytes`".** It's similar to a string, except that **it contains encoded bytes values**.<br>

When we create an object with a bytes type from a string, we specify an encoding system (usually `UTF-8`).<br>

Then, we can use the `.encode()` method to encode the string into bytes.

* Encode batman in `UTF-8` and assign it to `batman_bytes`.

In [35]:
# We can make a string with some Unicode values
superman = "Clark Kent␦"
print(superman)

Clark Kent␦


In [36]:
# This tells Python to encode the string 'superman' as Unicode
# using the UTF-8 encoding system
# We end up with a sequence of bytes intead of a string

superman_bytes = "Clark Kent␦".encode('utf-8')
superman_bytes

b'Clark Kent\xe2\x90\xa6'

In [37]:
batman = "Bruce Wayne␦"
batman_bytes = batman.encode('utf-8')

print(batman_bytes)

b'Bruce Wayne\xe2\x90\xa6'


## Introduction to HexaDecimal

`batman_bytes` from the last screen prints out as `Bruce Wayne\xe2\x90\xa6`. 
* Similar to the `\u` prefix for a **Unicode code point**, 
* `\x` is the prefix for a **hexadecimal character**.

Just like binary is base `2` and our normal counting system is base `10`, hexadecimal is base `16`. The valid digits in hexadecimal are `0-9` and `A-F`. Here are the values corresponding to each character:

* `A` - 10
* `B` - 11
* `C` - 12
* `D` - 13
* `E` - 14
* `F` - 15

In hexadecimal, $9 + 1$ equals `A`. We use hexadecimal because it represents a byte efficiently. You may recall that **a byte is eight bits**, or **eight binary digits**. 
* The highest value we can express in a byte is `11111111`, or `255` in base 10. 
* We can express the same value in `two hexadecimal digits`, `FF`.

Programmers often use hexadecimal to display bytes instead of binary because **it's more compact and easier to write out**.

## Hexadecimal Conversions

On the last screen, you might have noticed that using `.encode()` converted a sequence of code points into something that looked like `\xe2\x90\xa6`.<br>

The three sections of this result (which the `\` character separates) represent three hexadecimal bytes. The `\x` prefix means "the next two digits are in hexadecimal."<br>

**Two hexadecimal digits equal eight binary digits**, because digits can have higher values in hexadecimal (base 16). 
* For instance, `"F"` is `15` in hexadecimal, 
* but `1111` is `15` in binary.<br>

Because it's shorter to display, and four binary digits always equal one hexadecimal digit, programs often use hexadecimal to print out values. This is purely for convenience.<br>

Let's experiment a bit with hexadecimal conversions.


In [38]:
# F is the highest single digit in hexadecimal (base 16)
# Its value is 15 in base 10
print(int("F", 16))

# A in base 16 has the value 10 in base 10
print(int("A", 16))

# Just like the earlier binary_add function, this adds two hexadecimal numbers
def hexadecimal_add(a, b):
    return hex(int(a, 16) + int(b, 16))[2:]

# When we add 1 to 9 in hexadecimal, it becomes "a"
value = "9"
value = hexadecimal_add(value, "1")
print(value)

15
10
a


In [39]:
hex_ea = hexadecimal_add("ea", "2")
hex_ef = hexadecimal_add("f", "e")

print(hex_ea, hex_ef)

ec 1d


## Hex to Binary

We can convert hexadecimal to binary fairly easily. We can even use the `ord()` and `bin()` functions that helped us convert code points to binary.


In [None]:
# One byte (eight bits) in hexadecimal (the value of the byte below is \xe2)
hex_byte = "â"

In [40]:
# Print the base 10 integer value for the hexadecimal byte
print(ord(hex_byte))

# This gives the exact same value. Remember that \x is just a prefix, and doesn't affect the value.
print(int("e2", 16))

# Convert the base 10 integer to binary
print(bin(ord("â")))

226
226
0b11100010


* Convert the hexadecimal byte `"\xaa"` to binary, and assign the result to `binary_aa`.
* Convert the hexadecimal byte `"\xab"` to binary, and assign the result to `binary_ab`.

In [41]:
binary_aa = bin(ord("\xaa"))
binary_ab = bin(ord("\xab"))

print(binary_aa, binary_ab)

0b10101010 0b10101011


## Bytes and Strings

There's **no encoding system associated with the `bytes` data type**. That means if we have an object with that data type, Python won't know how to display the (encoded) code points in it. For this reason, we can't mix `bytes` objects and strings together.

In [42]:
hulk_bytes = "Bruce Banner␦".encode("utf-8")

# We can't mix strings and bytes
# For instance, if we try to replace the Unicode ␦ character as a string, it won't work, because that value has been encoded to bytes
try:
    hulk_bytes.replace("Banner", "")
except Exception:
    print("TypeError with replacement")

TypeError with replacement


In [43]:
# We can create objects of the bytes data type 
# by putting a 'b' in front of the quotation marks in a string
hulk_bytes = b"Bruce Banner"

# Now, instead of mixing strings and bytes, 
# we can use the replace method with bytes objects instead
hulk_bytes.replace(b"Banner", b"")

b'Bruce '

## Decode Bytes to Strings

Once we have **a bytes object**, we can **decode it into a string** using an encoding system. We use the `.decode()` method to do this.


In [44]:
# Make a bytes object with aquaman's secret identity
aquaman_bytes = b"Who knows?"

# Now, we can use the decode method, along with the encoding system (UTF-8) to turn it into a string
aquaman = aquaman_bytes.decode("utf-8")

# We can print the value and type to verify that it's a string
print(aquaman)
print(type(aquaman))

Who knows?
<class 'str'>


In [47]:
morgan_freeman_bytes = b"Morgan Freeman"
morgan_freeman = morgan_freeman_bytes.decode('utf-8')

print(morgan_freeman, type(morgan_freeman))

Morgan Freeman <class 'str'>


## Read in File Data

Now that we understand Unicode, we can go ahead and read our data in.<br>

The data contains several symbols and other Unicode characters. We'll learn how to address them as we go along.

In [48]:
# We can read our data in using csvreader
import csv

In [49]:
# When we open a file, we can specify the system used to encode it (in this case, UTF-8).
f = open("data/sentences_cia.csv", 'r', encoding="utf-8")
csvreader = csv.reader(f)
sentences_cia = list(csvreader)

# The data consists of two columns
# The first column contains the year, 
# and the second contains a sentence from a CIA report written in that year

# Print the first column of the second row
print(sentences_cia[1][0])

# Print the second column of the second row
print(sentences_cia[1][1])

1997
The FBI information included that al-Mairi's brother "traveled to Afghanistan in 1997-1998 to train in Bin - Ladencamps."


In [54]:
sentences_ten = sentences_cia[9][1]
sentences_ten

'\'^^^ Prior to Abu Zubaydah\'s capture, the CIA considered Hassan Ghul a "First Priority Raid Target," based on reporting that: 97470(281317ZMAR02)("InNovember1998, [Muhammad] Atta, [Ramzi] Binalshibh, and [Said] Bahaji moved into the 54 Marienstrasse apartment in Hamburg that became the hub of the Hamburg cell.").'

## Convert to a dataframe

To make this easier for ourselves, let's convert our sentences to a pandas dataframe.<br>

Having a dataframe will make processing and analysis much simpler because we can use the `.apply()` method.
* Convert `sentences_cia` to a dataframe. Remember to handle the headers properly by mirroring the technique demonstrated in the default, display code.

In [56]:
import csv

# Let's read in the legislators data from a few missions ago
f = open("data/legislators.csv", 'r', encoding="utf-8")
csvreader = csv.reader(f)
legislators = list(csvreader)

In [57]:
# Now, we can import pandas and use the DataFrame class to convert the list of lists to a dataframe.
import pandas as pd

legislators_df = pd.DataFrame(legislators)

# As you can see, the first row contains the headers, which we don't want (because they're not actually data)
print(legislators_df.iloc[0,:])

# To remove the headers, we'll subset the df and pass them in separately
# This code removes the headers from legislators, and instead passes them into the columns argument
# The columns argument specifies column names
legislators_df = pd.DataFrame(legislators[1:], columns=legislators[0])

# We now have the right data in the first row, as well as the proper headers
print(legislators_df.iloc[0,:])

# The sentences_cia data from the last screen is available.

0     last_name
1    first_name
2      birthday
3        gender
4          type
5         state
6         party
Name: 0, dtype: object
last_name                 Bassett
first_name                Richard
birthday               1745-04-02
gender                          M
type                          sen
state                          DE
party         Anti-Administration
Name: 0, dtype: object


Convert `sentences_cia` to a dataframe. Remember to handle the headers properly by mirroring the technique demonstrated in the default, display code.

In [60]:
sentences_cia_df = pd.DataFrame(sentences_cia[1:], columns = sentences_cia[0])
sentences_cia_df.head()

Unnamed: 0,year,statement,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,1997,The FBI information included that al-Mairi's b...,,,
1,1997,The FBI information included that al-Mairi's b...,,,
2,1997,"For example, on October 12, 2004, another CIA ...",,,
3,1997,"On October 16, 2001, an email from a CTC offic...",,,
4,1997,"For example, on October 12, 2004, another CIA ...",,,


## Clean up Sentences

Now that we've formatted our data nicely, we need to process the strings to count term occurences.<br>

First, though, we need to clean them up by **removing extraneous symbols**. We only really care about letters, digits, and spaces.<br>

Luckily, **we can check the integer code of each character using `ord()` to see if it's a character we want to keep**.

In [62]:
# The integer codes for all the characters we want to keep
good_characters = [48, 49, 50, 51, 52, 53, 54, 55, 56, 
                   65, 66, 67, 68, 69, 70, 71, 72, 73, 
                   74, 75, 76, 77, 78, 79, 80, 81, 82, 
                   83, 84, 85, 86, 87, 88, 89, 90, 97, 
                   98, 99, 100, 101, 102, 103, 104, 105, 
                   106, 107, 108, 109, 110, 111, 112, 113, 
                   114, 115, 116, 117, 118, 119, 120, 121, 
                   122, 32]

In [None]:
sentence_15 = sentences_cia_df["statement"][14]

# Iterate over the characters in the sentence, and only take those whose integer representations are in good_characters
# This will construct a list of single characters
cleaned_sentence_15_list = [s for s in sentence_15 if ord(s) in good_characters]

# Join the list together, separated by "" (no space), which creates a string again
cleaned_sentence_15 = "".join(cleaned_sentence_15_list)

* Make a function that takes a dataframe row and then returns the clean version of the `"statement"` column.
* Use the `.apply()` method on dataframe to apply the function to each row of `sentences_cia`.
* Assign the resulting vector to the `cleaned_statement` column of `sentences_cia`.

In [63]:
def clean_string(row):
    clean_list = [c for c in row['statement'] if ord(c) in good_characters]
    return "".join(clean_list)

In [66]:
sentences_cia_df['cleaned_statement'] = sentences_cia_df.apply(clean_string, axis=1)

## Tokenize Statements

Now we need to combine the sentences and convert them to tokens. The eventual goal is to count up how many times each term occurs.

* Tokenize `combined_statements` by splitting it into words on the spaces.
  * You should end up with a list of all the words in `combined_statements`.
  * Assign the result to `statement_tokens`.

In [74]:
# We can use the .join() method on strings to join lists together.
# The string we use the method on will become the separator -- the character(s) between each string when they are joined..
combined_statements = " ".join(sentences_cia_df["cleaned_statement"])

In [75]:
statement_tokens = combined_statements.split(" ")

## Filter the Tokens

We want to count how many times each term occurs in our data, so we can find the most common items.<br>

The problem is that the most common words in the English language are ones that are relatively uninteresting to us right now -- words like `"the"`, `"a"`, and so on. These words are called stopwords - words that don't add much information to our analysis.<br>

It's common to filter out any words on a list of known stopwords. What we'll do here for the sake of simplicity is filter out any words less than five characters long. This should remove most stopwords.

* Filter the `statement_tokens` list so that it only contains tokens that are at least five characters long.
  * Assign the result to `filtered_tokens`.

In [76]:
filtered_tokens = [token for token in statement_tokens if len(token) >= 5]

## Count the Tokens

Now that we've filtered the tokens, we can count how many times each one occurs. The `Counter` object from the `collections` library will help us with this.<br>

`Counter` takes a list as input. It creates a dictionary where the keys are list items, and the values are the number of times those items appear in the list.

* Count the items in `filtered_tokens` and assign the result to `filtered_token_counts`.

In [77]:
from collections import Counter
fruits = ["apple", "apple", "banana", "orange", "pear", "orange", "apple", "grape"]
fruit_count = Counter(fruits)

# Our code has counted each of the items in the list, and given them dictionary keys
print(fruit_count)

Counter({'apple': 3, 'orange': 2, 'banana': 1, 'pear': 1, 'grape': 1})


In [78]:
filtered_token_counts = Counter(filtered_tokens)

## Most Common Tokens

Now that we've created a `Counter` object using `filtered_tokens` as input, let's find the most common tokens.

Get the three most common items in `filtered_token_counts`, and assign the result to `common_tokens`.

In [84]:
common_tokens = filtered_token_counts.most_common(3)
common_tokens

[('interrogation', 391), ('information', 375), ('REDACTED', 375)]

## Finding the Most Common Tokens by Year

Let's write a function that computes the most common terms by year.

* Write a function that finds the two most common terms in `sentences_cia` for a given year (the `"year"` column).
  * The `"year"` column in `sentences_cia` stores strings, so you'll need to pass strings into the function.
  * Select the rows in `sentences_cia` that match that year, combine the clean statements, split them into a list on the space character (`" "`), filter out words less than five characters long, make a counter object with the results, and find the two most common items in the counter.

* Use the function to find the most common terms for `"2000"`. Assign the result to `common_2000`.
* Use the function to find the most common terms for `"2002"`. Assign the result to `common_2002`.
* Use the function to find the most common terms for `"2013"`. Assign the result to `common_2013`.

In [89]:
def find_top2_common_term(df, year):
    
    cleaned_ = df[df['year']==year].apply(clean_string, axis=1)
    combined_cleaned_ = " ".join(cleaned_)
    tokens = combined_cleaned_.split(" ")
    tokens_filtered = [tk for tk in tokens if len(tk) >= 5]
    
    return Counter(tokens_filtered).most_common(2)

In [90]:
common_2000 = find_top2_common_term(sentences_cia_df, "2000")
common_2000

[('terrorist', 9), ('Ahmad', 9)]

In [91]:
common_2002 = find_top2_common_term(sentences_cia_df, "2002")
common_2002

[('interrogation', 275), ('Zubaydah', 252)]

In [92]:
common_2013 = find_top2_common_term(sentences_cia_df, "2013")
common_2013

[('Response', 196), ('states', 111)]