In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab01.ipynb")

# Lab 1: Data Types and Structures in Python

Welcome to DATA 271: Data Wrangling and Visualization! This is your first weekly lab assignment of the semester. 

This document contains examples and small tasks ("appetizers") for you to make sure you understand the examples.  The culminating task ("main course") at the end of the document is more complex, and uses most of the topics you have will have worked through. You should rarely remain stuck for more than 10-15 minutes on questions in labs, so please ask for help. Collaborating on labs is encouraged! Explaining concepts is beneficial for learning -- one of the best way to solidify your knowledge of a subject is to teach it. Please don't just share answers, though. 

For this lab and all future ones, please be sure to not re-assign variables throughout the notebook. For example, if you use `my_list` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you passed previously!

### In today's lab, we will
- Become familiar with Python's data types and structures.
- Become familiar with common functions and methods for these data structures.
- Understand indexing and slicing.
- Write short programs to solve small problems using these data structures, functions and methods.

## Overview

A *data structure* is an abstract description of a way of organizing data to allow certain operations on it to be performed efficiently. Theoreticians describe data structures and prove their properties in order to show that certain algorithms or problems can be solved efficiently under certain assumptions.

A *data type* is a class of objects that all share some property. For example, "integer" is a data type containing all of the infinitely many integers, "string" is a data type containing all of the infinitely many strings, and "32-bit integer" is a data type containing all integers expressible in thirty-two bits. 

A nice analogy from Stack Overflow is that a data type is like an atom, while data structures are like molecules, meaning data types can't be further reduced, whereas a data structure may consist of multiple fields of different data.

The basic Python data structures include list, set, tuples, and dictionary. Each of the data structures is unique in its own way, and we will investigate these properties in this activity. Data structures are “containers” that organize and group data according to type.

The data structures differ based on mutability and order. Mutability refers to the ability to change an object after its creation. Mutable objects can be modified after they’ve been created, while immutable objects cannot be modified after their creation. Order, in this context, relates to whether the position of an element can be used to access the element.

Lists, sets, and dictionaries are **mutable**. They *can* be modified. 
String and tuples are **immutable**. They *cannot* be modified. 

The main Python scalar data types are:
* numeric
    * int
    * float
    * etc. (e.g. complex)
* string
* bool

The main Python data structures are:
* list
* tuple
* set
* dictionary

These are all **objects** and have properties and methods associated with them.  

## Type and Converting Types

If you would like to know the type of a value, you can use `type()`.

In [None]:
type(7)

In [None]:
type(7.1)

In [None]:
type('Hello World')

In [None]:
type(True)

In [None]:
type(5+3j)

You can convert values from one type to another using built-in Python functions. This process is called *type casting.*

In [None]:
s = '32'
type(s)

In [None]:
t = int(s)
type(t)
# this creates a new variable called t that is an int, and does not change original variable s.

In [None]:
b = True
type(b)

In [None]:
i = int(True)
i

In [None]:
one = 1
type(one)

In [None]:
f_one = float(one)
f_one

In [None]:
int(float('1.23456789'))

**Question 1.1**: Convert 3.99 to an integer.

In [None]:
...

**Question 1.2:** Consider `3.14159`. What is its type?

In [None]:
...

**Question 1.3:** Convert `3.14159` to a string.

In [None]:
...

## Strings

A string is a sequence of characters and we can acccess the characters using bracket notation with an integer inside the bracket.  In Python, indexing starts at `0`.  Strings are immutable.
    We can't modify a string once we make it.  An empty string can be created with either single or double quotes(`''` or `""`).

In [None]:
fruit = 'pineapple'
letter = fruit[2]
print(letter)

### Slicing
We can slice a strings.  For example, if we want to print the `ap` in the middle of the word pineapple, we can give the index where the slice should start and the index one beyond where it should end.  If these numbers are the same, you will get an empty string. If you start at the beginning or go all the way to the end, you do not need the second index.

Indexing is a good thing to pay attention to because we will generalize it when working in data frames.

In [None]:
print(fruit[4:6])

In [None]:
print(fruit[:6]) # from first to 6th position (indices 0 to 5)

In [None]:
print(fruit[4:]) # from 5th to last position (indices 4 to 8)

In [None]:
print(fruit[-1]) # notice the negative index.  This will print the last character

**Question 2.1:** Consider the string 
```python 
"A quick brown fox jumps over the lazy dog." 
``` 
Print the word `dog` using slicing and positive indices.

In [None]:
phrase = ...
dog_with_positive_inds = ...
print(dog_with_positive_inds)

In [None]:
grader.check("q21")

**Question 2.2:** Consider the same string 
```python
"A quick brown fox jumps over the lazy dog." 
```
Print the word `dog` using slicing and negative indices.

In [None]:
dog_with_negative_inds = ...
print(dog_with_negative_inds)

In [None]:
grader.check("q22")

**Question 2.3:** Using `phrase` and slicing, assign `new_phrase` to the string 
```python
"fox jumps over the lazy dog." 
```

In [None]:
new_phrase = ...
print(new_phrase)

In [None]:
grader.check("q23")

### Methods 
Python has built in functions and methods that work on strings.  
Methods are functions which are built into the object and are available to any instance of the object.
For a whole list, do a quick google search or type `dir(str)` or `dir(fruit)` which will list the methods available for strings. You can also type `fruit.` and then press the `tab` key for method options. For today, we will start by highlighting
- `len()` which is a function returns the length of an object
- `.upper()` which is a string method that will convert the string to uppercase
- `.lower()` which is a string method that will convert the string to lowercase

Calling a function is done with the function name, parentheses and whatever argument(s) it takes.
Calling a method is similar to calling a function, but the syntax is different.  It is the `variable_name.method_name()` (the period is a delimiter).  Calling a method is called an invocation (we are invoking the method on the variable).

In [None]:
# len is a function
len(fruit) # how long is the word pineapple

We are invoking `.upper()` on fruit.

In [None]:
# upper is a method
fruit.upper()

In [None]:
# note, the orginal fruit variable is unchanged
fruit

Remember, string are immutable, so if we want the uppercase word PINEAPPLE, we will need to save it to a variable.

In [None]:
upper_fruit = fruit.upper()
upper_fruit

We are invoking find on fruit to find the index of the letter `e` in the word "pineapple".

In [None]:
fruit.find('e')

We can also find substrings as well as a single character.  If `find()` returns $-1$, it means the substring was not found.

In [None]:
fruit.find('ine')

In [None]:
fruit.find('data271')

We can change substrings in a string by using the `replace()` method which takes the old substring as a first argument and a new substring as a second argument. 

In [None]:
fruit.replace('apple','cone')

In [None]:
# Again, this does not change the original string
fruit

`replace()` replaces all occurences of the old substring with the new one.

In [None]:
fruit.replace('p','b')

**Question 3.1:** Consider again the string 
```python
"A quick brown fox jumps over the lazy dog." 
```
What is the index of the word `brown`?

In [None]:
index_of_brown = ...
index_of_brown

In [None]:
grader.check("q31")

**Question 3.2:** Is the `.find()` method case-sensitive? Assign `case_sensitive` to either `True` or `False`. 

In [None]:
case_sensitive = ...

In [None]:
grader.check("q32")

**Question 3.3:** Find the index of the word "coyote" in `phrase.`

In [None]:
coyote_index = ...
coyote_index

In [None]:
grader.check("q33")

**Question 3.4:** Replace the word "dog" with the word "coyote" in `phrase.`

In [None]:
coyote_phrase = ...
coyote_phrase

In [None]:
grader.check("q34")

**Question 3.5:** Now find the index of the word "coyote" in `coyote_phrase.`

In [None]:
new_coyote_index = ...
new_coyote_index

In [None]:
grader.check("q35")

**Question 3.6:** Consider the Python code below which assigns a string to the name `my_string`. Use `find()` and string slicing to extract the portion of the string after the colon and then convert the extracted string to a float.

In [None]:
my_string = 'K-ARGT-Russe: 5.43'
index = ...
number = ...
number

In [None]:
grader.check("q36")

## Lists

A list is a sequence of values (potentially of different types).  Lists are created with the square bracket notation.  Values are called elements, and elements are accessed with bracket notation (just as they are for strings).  In general, we can view lists as a mapping between indices and elements.  Lists are **mutable.**

In [None]:
list1 = ['fox', 7, 3.1, [1,2]] # this list has a string, an int, a float, and a list as its elements
list1[1]

In [None]:
list1[1] = 'apple' # reassign the second element of the list
list1

### Traversing a List
It can be convenient to use the functions `range()` and `len()` to loop through a list and update elements.  For example, consider the list of numbers below and return a new list which is the square of each number.

In [None]:
list2 = [1, 2, 3, 4]
for i in range(len(list2)):
    list2[i] = list2[i] ** 2
list2   # note, list2 is the list of squares and the original list has been overwritten.

*Side note:* You can also loop through strings if you want to.

In [None]:
cal_poly_string = "I am a Cal Poly Humboldt student. Woohoo!"

num_vowels = 0
for character in cal_poly_string:
    if character.lower() == 'a':
        num_vowels += 1
    elif character.lower() == 'e':
        num_vowels += 1
    elif character.lower() == 'i':
        num_vowels += 1
    elif character.lower() == 'o':
        num_vowels += 1
    elif character.lower() == 'u':
        num_vowels += 1
num_vowels

*Note:* The cell above could be more efficient by using the following code.

In [None]:
num_vowels = 0 
for character in cal_poly_string:
    if character.lower() in 'aeiou':
        num_vowels += 1
num_vowels

**Question 4.1:** Use the functions `range()` and `len()` to update the list `list3` so that it contains each number from the original `list3` divided by 10. Make sure that each element of the result is an integer, not a float.

In [None]:
list3 = [100, 200, 300, 400]
...
    list3[i] = ...
list3   

In [None]:
grader.check("q41")

### Methods

Lists have a number methods such as `append()`, `sort()`, `copy()`, `insert()`, `pop()`, `remove()`, `reverse()` among others.  For example, the method `pop()` takes the index of the item you want to delete as an argument and returns the element removed from the list.  The method `remove()` takes as an argument the element you want to remove (not the index).  **These methods modify the list**.  If you might need the original values, make sure to make a copy first.

In [None]:
list4 = [47, 3, 7, 8]
list4.sort()
list4

Suppose we wanted to sort this list in descending order. When we used the `datascience` library in DATA 111, we used a `datascience.Table` method `.sort`, which accepted one argument for the column of the table to sort by and an optional second argument `descending = True` to sort in descending order. Lets see if this works for the `sort` method on lists.   

In [None]:
list4.sort(descending = True)

It looks like that didn't work. Fortunately, we can call `help()` on any built in function or method to read its documentation and find out more about its arguments. 

In [None]:
help(list4.sort)

Looks like this method takes an argument called `reverse` which has a default value of `False`. Let's change that and see what happens.

In [None]:
list4.sort(reverse=True)
list4

Nice! Before we move on and explore more methods and fucntions, it's helpful to know that another way to look up the documentation for a particular function or method is my using a question mark `?` before or after it. The docstring should show up as a pop-up. This process is referred to as *object introspection* and it can be used on any object including variables you've defined.

In [None]:
list4.sort?

In [None]:
fruit?

The `append()` method allows us to append new values to a list.

In [None]:
list4.append(54)
list4

Lists also have a number of built in functions that save you from writing a loop.  Examples include `len()`, `max()`, `min()`, `sum()`.

In [None]:
max(list4)

### Lambda 
Using `lambda` (a Python keyword to generate an anonymous function) ultimately means you don't have to define an entire function. Lambda functions are created, used, and immediately destroyed - so they don't clutter your code with more code that will only ever be used once.  Lambda syntax is as follows: 

`lambda input_variable(s): nice one liner`  

The function is called right as it is created. Notice that you do not give the function a name because it will never be used again.

In [None]:
# example lambda function to divide two numbers
(lambda x, y: x / y)(10, 2)

### Another Way to Sort Lists and Use Lambda
We can also sort lists with the function `sorted()`, which returns a list of sorted elements. If we want to sort in a particular way or if we want to sort a complex list of elements (e.g., nested lists or a list of tuples) we can invoke the `key` argument.

In [None]:
help(sorted)

The idea behind the key argument is that it should take in a set of instructions that will essentially point the `sorted()` function at those list elements which should be used to sort by. When it says `key=`, what it really means is: As I iterate through the list, one element at a time, I'm going to pass the current element to the function specifed by the key argument and use that to create a transformed list which will inform me on the order of the final sorted list.

With no key specified, `sorted()` will return elements in ascending order. **This will not modify the list**. 

In [None]:
mylist = [3, 6, 23, 2, 4, 8, 3]  # an example list
print(sorted(mylist))
print(mylist)

If we instead wanted to separate out even and odd numbers in the list, we can use a key and a lambda function.  Our lambda function checks to see if a number is even (no remainder when dividing by 2).  It might seem strange that the odd numbers are returned before the even numbers, but the statement `x % 2 == 0` returns either a `0` for `False` or a `1` for `True`. 

In [None]:
sorted(mylist, key = lambda x: x % 2 == 0)

You might also notice that the even numbers are not sorted in ascending order.  This is because the function `sorted()` only sorts once, so the those numbers remain in their original order relative to each other.

As another example, if we want to sort a list of lists by the second element, we can use the second element as the key.  The lambda function takes the second element of each list for the sorting.

In [None]:
mylist2 = [[3, 5, 8], [6, 2, 8], [2, 9, 4], [6, 8, 5]]
sorted(mylist2, key=lambda x: x[1])

**Question 5.1:** Sort the list `['aaac', 'ccb', 'd', 'ba']` by length of string.

In [None]:
mylist3 = ['aaac', 'ccb', 'd', 'ba']
sorted_by_length = ...
sorted_by_length

In [None]:
grader.check("q51")

**Question 5.2:** Sort the list `['aaac', 'ccb', 'd', 'ba']` by the last letter of the string.

In [None]:
sorted_by_last = ...
sorted_by_last

In [None]:
grader.check("q52")

### Split and Join
These methods conveniently allow us to convert a string to a list and vice versa.  (They can work on other data structures also, but we focus on strings and lists here.)

In [None]:
# turn string into list of words
data_phrase = 'It is a capital mistake to theorize before one has data'
listofwords = data_phrase.split(' ') # split on the space
listofwords

In [None]:
# turn list of words into a string with a delimiter like a space
list_of_words = ['Live', 'what', 'you', 'love']
delimiter = ' ' # put space between words
quote = delimiter.join(list_of_words)
quote

**Question 6.1:** Create a list called `random_numbers` of 50 random integers between 1 and 100.  Create a second list called `div_3_list` from the first list containing only numbers in the original list which are divisible by $3$.  Repeat the experiment 3 times.  Then calculate the average difference in length between the two lists.  

*Hint*: The `random` library allows us to generate randome numbers. For example, to make a list of 5 random numbers between 1 and 10, the following code will work:

In [None]:
import random 

example_random_list = [random.randint(0,10) for x in range(5)] # this syntax will become more clear next week
print(example_random_list)

In [None]:
number_of_experiments = ... 
# make an empty list to hold difference for each experiment
difference = []
for ... in ...: 
    # one experiment
    random_numbers = ...
    div_3_list = ... # % is the modulus and returns the remainder after division
    difference.append(len(random_numbers) - len(div_3_list))


# calculate the mean
average_difference = ...
average_difference

In [None]:
grader.check("q61")

## Dictionaries

A dictionary is similar to a list, but more general.  In a list, the index positions must be integers, but in a dictionary there is more freedom for the type of the indices.  We think of a dictionary as mapping between *keys* and *values*.  There is no intrinsic ordering.  For example, we could consider our favorite Hollywood actors (keys) and their respective ages (values).  We could also consider a literal dictionary, where keys are English words and values are the Spanish equivalent.  In this example, both keys and values are strings.  We create an empty dictionary using `dict()` or `{}` (empty curly braces}.  We use the square brackets to add items.

* the dictionary method `values()` returns the values in a type that can be converted to a list.
- the dictionary method `fromkeys()` creates a dictionary from a given sequence of keys and values.  It can take two parameters: keys and values (optional).
- the dictionary method `keys()` has no parameters and returns a view object that displays the keys.
- the dictionary method `items()` returns the key-value pairs of the dictionary as tuples in a list.
- the `in` operator works on dictionaries. It checks to see if something is a key in a dictionary. 

Dictionaries can be quite useful as a set of counters.  For example, if you want to count the number of times each word appears in a text, you could create a dictionary and the first time you see a word, you could add it to the dictionary with the corresponding value of one.  If you see the word again, you will increment the value.  An advantage of this implementation is that we don't have to know ahead of time what words we will see.

In [None]:
fruitdict = {'a': 'apple', 'b': 'banana', 'c': 'cantaloupe'}
fruitdict.keys()

In [None]:
fruitdict.items()

In [None]:
fruitdict['a']

In [None]:
list(fruitdict.values())

In [None]:
'b' in fruitdict

In [None]:
'banana' in fruitdict # this is checking to see if banana is a key, not a value!

In [None]:
'banana' in list(fruitdict.values()) # this checks to see if banana is a value

**Problem 7:** Consider the following dictionary:

In [None]:
student_info = {
    'name': 'Alice',
    'age': 20,
    'grades': {'math': 90, 'english': 85, 'history': 88},
    'courses': ['math', 'english', 'history']
}

**Question 7.1:** Access and print Alice's age.

In [None]:
alice_age = ...
print(alice_age)

In [None]:
grader.check("q71")

**Question 7.2:** Access and print Alice's grade in English.

In [None]:
alice_egrade = ...
print(alice_egrade)

In [None]:
grader.check("q72")

**Question 7.3:** Add physics to the courses Alice took.

In [None]:
...
student_info

In [None]:
grader.check("q73")

**Question 7.4:** Update Alice's math grade to 95.

In [None]:
student_info

In [None]:
grader.check("q74")

**Question 7.5:** Print the keys of the `grades` dictionary.

In [None]:
grade_keys = ...
print(grade_keys)

In [None]:
grader.check("q75")

**Question 7.6:** Check if 'biology' is in the list of courses. 

In [None]:
is_bio = ...
print(is_bio)

In [None]:
grader.check("q76")

### Sorted Dictionaries 
We can also use the `sorted()` function with a key to sort a dictionary and use a `lambda` function to sort on the value instead of the key.  The `item()` function retrieves a dictionary's keys and values.

In [None]:
actor = {'Keanu': 58, 'Hugh' : 54, 'Jason': 43, 'Mark' : 51}
sorted_by_name = sorted(actor) # sort on name, this returns a list
sorted_by_age = sorted(actor.items(), key = lambda x: x[1]) # sort on age
# this returns a list of tuples
print(sorted_by_name)
print(sorted_by_age)

## Tuples

A tuples is a sequence of values.  The values can be any type and they are indexed by integers.  The important difference between tuples and lists is that tuples are **immutable** whereas lists, as we have seen, are **mutable**.  Tuples are comparable, so we can sort lists of tuples and use them as key values in Python dictionaries.  Tuples are created with parentheses and then listing values inside, separated with a comma.  An empty tuple is created with `()`.  Elements are accessed with square brackets, just like for lists.

In [None]:
tup1 = ('fox', 7, 3.1, [1,2])

In [None]:
tup1[0]

In [None]:
# we can't reassign values in a tuple because it is immutable
tup1[0] = 'coyote'

The comparison operator with tuples works by comparing the first element of each.  If they are equal, it goes on to the next element and continues until it finds an element that differs.  Once it finds an element that differs, it does not continue.  Consider the comparison below-- Python looks at the first element of each tuple, and since $0<1$ is true, it's done and never considers the next elements.

In [None]:
(0, 7, 2000) < (1, 0, 0)

### Tuple methods
Since tuples can't be modified, they do not have many methods.

In [None]:
a_tuple = (1,2,2,3,3,3,4,4,4,4,5,5,5,5,5)
a_tuple.count(2)

In [None]:
a_tuple.index(3)

### Sorting Tuples and DSU
The comparability of tuples can be used for sorting tasks.  This is part of a pattern termed DSU which stands for decorate-sort-undecorate.  The idea is to "decorate" a sequence by creating a list of tuples with a sort key before the element in the sequence (for example, if the sequence is a list of words, we would "decorate" this by creating a tuple (word length, word) for each.  We then sort the tuples.  We then "undecorate" by extracting the sorted elements of the sequence.

In [None]:
quote = 'The future belongs to those who believe in the beauty of their dreams'
words = quote.split() # split Eleanor Roosevelt's quote into a list of words
words

In [None]:
# DECORATE
t = list()
for word in words:
    t.append((len(word), word)) # for each word, store a tuple with the word length and the word in the list t
t

In [None]:
# SORT
t.sort(reverse = True) # sort the list with the longest word first
t

In [None]:
# UNDECORATE
result = list() # create an empty list and populate it with words in order of length
for length, word in t:
    result.append(word)
result

You may notice that the `for` loop in the last cell looks a little strange. It seems we assigned 2 variables `length` and `word` for each iteration of the loop. This is called *multi-assignment syntax* and it can it can be used when you are iterating through an object that holds containers or sequences.

Along these lines, a unique syntactic feature of Python is that we can have multiple variables on the **left** side of an assignment.  This allows us to assign more than one variable at a time.  This also gives us a shortcut to swap values of variables.

In [None]:
tup2 = ('apple', 'banana')
x, y = tup2
print(x)

In [None]:
y, x = x, y
print(x)

### Sorting Dictionaries Revisited

We mentioned above that dictionaries have a method called `items()` which returns a collection of tuples.  Each tuple is of the form (key, value).  Dictionaries have no ordering, so we do not expect this list to be in order.  However, we can sort a list of those tuples, and this gives us a way to sort the contents of a dictionary by key.  Compare this to the sorting above.  This works for sorting by the key, but not the value.

In [None]:
actor = {'Keanu': 58, 'Hugh' : 54, 'Jason': 43, 'Mark' : 51}
t = list(actor.items()) # returns a list of tuples of key-value pairs
t

In [None]:
t.sort()
t

## Sets

A set is, essentially a collection of different things.  Each element is listed only once.  This can be useful in coding if we want to eliminate of duplicate entries in a list.

In [None]:
fruit_list = ['apple', 'banana', 'apple', 'pear']
set(fruit_list)

In [None]:
fruit_list_unique = list(set(fruit_list)) # this turns a list to a set to eliminate duplicates and then turns that set back to a list.  
fruit_list_unique

You can find common elements in sets using the `.intersection` method.

In [None]:
a = {1,2,3,4,5}
b = {3,4,5,6,7,8}

a.intersection(b)

And you can find all unique elements occuring in either of two sets with the `.union`method.

In [None]:
a.union(b)

**Question 8.1:** Consider the two lists `listA` and `listB` given below. Assign `common_elements` to a list of elements that occur in both `listA` and `listB`. 

In [None]:
listA = ['I', 'love', 'Data', 271, 'It', 'is', True]
listB = ['i', '<3', 'Data', 111, 'It is', True]

setA = ...
setB = ...

common_elements = ...
common_elements

In [None]:
grader.check("q81")

## Reading files

When you open a file, you are asking the operating system to find the file by its name and make sure it exists. If the open is successful (if the file exists and you have the proper permissions to read the file), the operating system returns a file handle.  This is not the actual data in the file, but a handle that can be used to read the data. Python has a built in function `open()`.
If the file is relatively small compared to the size of your main memory, you can read the whole file into one string using the `read()` method on the file handle.
In your culminating task below, you will be working with a `.txt` file and reading its contents into a string.  This code opens the file (which needs to be in the same directory as your Jupyter notebook or `.py` file), reads it and saves its contents to the string called multiline, and then closes the file.

We will learn more about reading files later in this course.

In [None]:
fhand = open('PrideandPredjudice.txt', 'r') # file handle, open in read mode
multiline = fhand.read()
fhand.close()

### Task (Main Course): 
In this task, we will analyze a multiline string and generate a unique word count.  We will use the first chapter of Jane Austen's book Pride and Predjudice.  We provided a preprocessed .txt file of Chapter 1.  This task relies on using
- file reading
- characters, strings, lists, and dictionaries as well as their methods
- sets to create a list of unique items
- sorting on a key
- lambda functions


The general steps will be:
- get the multiline text and save it to a Python variable called multiline 
- eliminate all new lines and all special characters using string methods (e.g., replace())
- find all unique words and their occurences in the string

In [None]:
multiline

**Question 9.1:** `multiline` is a string that contains multiple lines from the first chapter of Pride and Predjudice. Anywere where you see `\n` in the string is where there is a new line in the chapter. Begin by removing all the `\n` instances in the string.

In [None]:
singleline = ...
singleline

In [None]:
grader.check("q91")

**Question 9.2:** The string contains several special characters such as periods, quotation marks, commas, etc. Remove all special characters and replace them with a space. 

*Hint 1:* The string method `.isalnum` may help. 

In [None]:
help('hello'.isalnum)

*Hint 2:* Your solution will contain multiple spaces in a row sometimes. Don't worry about removing duplicate spaces - keep your answer as is. 

In [None]:
# remove special characters and punctuation
cleaned_singleline = ... # define empty string
for ... in ...:
    # keep letters, numbers, and spaces and replace all other characters with a space
    if ...:
        ...
    else:
        ...

cleaned_singleline

In [None]:
grader.check("q92")

**Question 9.3:** Generate a list of words by splitting the string from the previous question. 

In [None]:
list_of_words = ...
list_of_words

In [None]:
grader.check("q93")

**Question 9.4:** How many unique words are in Chapter 1 of Pride and Predjudice?

In [None]:
unique_words = ...
num_unique_words = ...
num_unique_words

In [None]:
grader.check("q94")

**Question 9.5:** Create a dictionary with unique words as keys. (*Hint:* Use `dict.fromkeys` on `list_of_words`. The values should be `None` in the dictionary you create).

In [None]:
unique_words_as_dict = ...
unique_words_as_dict

In [None]:
grader.check("q95")

**Question 9.6:** Populate values in dictionary with the number of times each word occured by looping through the words in `list_of_words`.

In [None]:
populated_unique_words = unique_words_as_dict
for ... in ...:
    if ...:
        ...
    else:
        ...

populated_unique_words

In [None]:
grader.check("q96")

**Question 9.7:** Sort the dictionary based on frequency of count in descending order. Your answer should be a list of tuples.

In [None]:
top_words = ...
top_words

In [None]:
grader.check("q97")

**Question 9.8:** What are the 25 most common words in Chapter 1 of Pride and Prejudice? Your answer should be a list of tuples containing the words and their cooresponding frequencies. 

In [None]:
top_25 = ...
top_25

In [None]:
grader.check("q98")

**Question 9.9:** What word has the highest frequency in Chapter 1 of Pride and Prejudice and what is its frequency?

In [None]:
most_common_word = ...
frequency = ...

most_common_word, frequency

In [None]:
grader.check("q99")

**Question 9.10:** What if words are not case sensitive?  Repeat the exercise with that assumption. Your final answer should be a list of tuples containing the top 25 most common words and their frequencies. Remember to avoid reassigning variables that you used before. (*Hint:* you will need several lines of code). 

In [None]:





new_top_words = ...
new_top_words[:25]

In [None]:
grader.check("q910")

### References
If you want to read more about any of these topics. Check out the following resources. 
- The Gutenberg Project.  https://www.gutenberg.org/files/1342/old/pandp12p.pdf
- Python for Everybody: Exploring Data in Python 3 by Charles Severance
- Data Wrangling with Python by Tirthajyoti Sarkar and Shubhadeep Roychowdhury (the word counting exercise came from this text).


## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)