# Text Analysis 01: Introduction to Python 


---
<img src="https://images.pexels.com/photos/270623/pexels-photo-270623.png?w=940&h=650&dpr=2&auto=compress&cs=tinysrgb" style="width: 500px; height: 275px;" />

### Professor Matthew Specter

This notebook will introduce students to Jupyter notebooks and the Python programming language. We will cover basic data types and syntax, building the foundations for basic text analysis. 

*Estimated Time: 45 minutes*

---

### Topics Covered
- Jupyter notebooks
- Programming with Python: expressions, variables, functions
- String data type and methods

### Table of Contents

[Introduction](#section intro)<br>

1 - [Our Computing Environment: Jupyter Notebooks](#section 1)<br>

2 - [Introduction to Programming with Python](#section 2)<br>

3 - [Strings](#section 3)<br>


---

Computers have vastly changed how we can interact with and analyze text. Thanks to more memory, faster processing, and 
new research into machine learning, it is possible to work with tens of thousands of documents at a time in ways that grow more sophisticated by the day.

Today, we will dive into the field of **Text Analysis** with the programming language Python. By the end of this workshop, you'll have the tools you need to start working with real documents and data, and the resources to support you in starting your own text analysis project.

---


## 1. Our Computing Environment, Jupyter notebooks  <a id='section 1'></a>


This webpage is called a Jupyter notebook. A notebook is a place to write programs and view their results. 

### Text cells
In a notebook, each rectangle containing text or code is called a *cell*.

Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings.  You don't need to learn Markdown, but you might want to.

After you edit a text cell, click the "run cell" button at the top that looks like ▶| to confirm any changes. (Try not to delete any of the cells given to you.)

### Code cells
Other cells contain code in the Python 3 language. Running a code cell will execute all of the code it contains.

To run the code in a code cell, first click on that cell to activate it.  It'll be highlighted with a little green or blue rectangle.  Next, either press ▶| or hold down the `shift` key and press `return` or `enter`.

Try running this cell:

In [None]:
print("Hello, World!")

And this one:

In [None]:
print("\N{WAVING HAND SIGN}, \N{EARTH GLOBE ASIA-AUSTRALIA}!")

The fundamental building block of Python code is an **expression**. Cells can contain multiple lines with multiple expressions. When you run a cell, the lines of code are executed in the order in which they appear. Every `print` expression prints a line. Run the next cell and notice the order of the output.

In [None]:
print("First this line is printed,")
print("and then this one.")

### Writing Jupyter notebooks
You can use Jupyter notebooks for your own projects or documents.  When you make your own notebook, you'll need to create your own cells for text and code.

To add a cell, click the + button in the menu bar.  It'll start out as a text cell.  You can change it to a code cell by clicking inside it so it's highlighted, clicking the drop-down box next to the restart (⟳) button in the menu bar, and choosing "Code".

**Understanding Check** In the code cell below this one, write code in it that prints out:
   
    A whole new cell! ♪🌏♪

(That musical note symbol is like the Earth symbol.  Its long-form name is `\N{EIGHTH NOTE}`.)

Run your cell to verify that it works.

In [None]:
#Fill in the print statement
print(...)

### Errors
Python is a language, and like natural human languages, it has rules.  It differs from natural language in two important ways:
1. The rules are *simple*.  You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.
2. The rules are *rigid*.  If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes.  A computer running Python code is not smart enough to do that.

Whenever you write code, you'll make mistakes.  When you run a code cell that has errors, Python will sometimes produce error messages to tell you what you did wrong.

Errors are okay; even experienced programmers make many errors.  When you make an error, you just have to find the source of the problem, fix it, and move on.

We have made an error in the next cell.  Run it and see what happens.

In [None]:
print("This line is missing something."

The last line of the error output attempts to tell you what went wrong.  The *syntax* of a language is its structure, and this `SyntaxError` tells you that you have created an illegal structure.  "`EOF`" means "end of file," so the message is saying Python expected you to write something more (in this case, a right parenthesis) before finishing the cell.

There's a lot of terminology in programming languages, but you don't need to know it all in order to program effectively. If you see a cryptic message like this, you can often get by without deciphering it.  (Of course, if you're frustrated, feel free to ask a friend or look up the message on a programming support site like Stack Overflow).

Try to fix the code above so that you can run the cell and see the intended message instead of an error.

---

## 2. Introduction to Programming with Python<a id='section 2'></a>

Before getting into the more advanced analysis techniques that will be required in this course, we need to cover a few of the foundational elements of programming in Python.
#### A. Expressions
The departure point for all programming is the concept of the __expression__. An expression is a combination of variables, operators, and other Python elements that the language interprets and acts upon. Expressions act as a set of instructions to be fed through the interpreter, with the goal of generating specific outcomes. See below for some examples of basic expressions.

In [None]:
# Examples of expressions:

2 + 2

'me' + ' and I'

12 ** 2

6 + 4

You will notice that only the last line in a cell gets printed out. If you want to see the values of previous expressions, you need to call `print` on that expression. Try adding `print` statements to some of the above expressions to get them to display.

#### B. Variables
In the example below, `a` and `b` are Python objects known as __variables__. We are giving an object (in this case, an `integer` and a `float`, two Python data types) a name that we can store for later use. To use that value, we can simply type the name that we stored the value as. Variables are stored within the notebook's environment, meaning stored variable values carry over from cell to cell.

In [None]:
a = 4
b = 10/5

Notice that when you create a variable, unlike what you previously saw with the expressions, it does not print anything out.

In [None]:
# Notice that 'a' retains its value.
print(a)
a + b

#### C. Lists
The next topic is particularly useful in the kind of data manipulation that you will see throughout this workshop. The following few cells will introduce the concept of __lists__ (and their counterpart, `numpy arrays`). Read through the following cell to understand the basic structure of a list. 

A list is an ordered collection of objects. They allow us to store and access groups of variables and other objects for easy access and analysis. Check out this [documentation](https://www.tutorialspoint.com/python/python_lists.htm) for an in-depth look at the capabilities of lists.

To initialize a list, you use brackets. Putting objects separated by commas in between the brackets will add them to the list. 

In [None]:
# an empty list
lst = []
print(lst)

# reassigning our empty list to a new list
lst = [1, 3, 6, 'lists', 'are' 'fun', 4]
print(lst)

To access a value in the list, put the index of the item you wish to access in brackets following the variable that stores the list. Lists in Python are zero-indexed, so the indicies for `lst` are 0, 1, 2, 3, 4, 5, and 6.

In [None]:
# Elements are selected like this:
example = lst[2]

# The above line selects the 3rd element of lst (list indices are 0-offset) and sets it to a variable named example.
print(example)

#### D. Functions!
Functions are useful when you want to repeat a series of steps on multiple different objects, but don't want to type out the steps over and over again. Many functions are built into Python already; for example, you've already made use of `len()` to retrieve the number of elements in a list. You can also write your own functions, and at this point you already have the skills to do so.


Functions generally take a set of __parameters__ (also called inputs), which define the objects they will use when they are run. For example, the `len()` function takes a list or array as its parameter, and returns the length of that list.


The following cell gives an example of an extremely simple function, called `add_two`, which takes as its parameter an integer and returns that integer with, you guessed it, 2 added to it.

In [None]:
# An adder function that adds 2 to the given n.
def add_two(n):
    return n + 2

In [None]:
add_two(5)

Easy enough, right? Let's look at a function that takes two parameters, compares them somehow, and then returns a boolean value (`True` or `False`) depending on the comparison. The `is_multiple` function below takes as parameters an integer `m` and an integer `n`, checks if `m` is a multiple of `n`, and returns `True` if it is. Otherwise, it returns `False`. 
​
`if` statements, are dependent on boolean expressions. If the conditional is `True`, then the following indented code block will be executed. If the conditional evaluates to `False`, then the code block will be skipped over. Read more about `if` statements [here](https://www.tutorialspoint.com/python/python_if_else.htm).

In [None]:
def is_multiple(m, n):
    if (m % n == 0):
        return True
    else:
        return False

In [None]:
is_multiple(12, 4)

In [None]:
is_multiple(12, 7)

---
## 3. Strings <a id='section 3'></a>

A string is Python's understanding of text characters. How do I know if something is a `str`? 

In [None]:
type('Hello!')

In [None]:
type(10)

In [None]:
type('10')

## We can do things with strings

We've already seen how Python handles mathematical operations. We can do some of them on `str` too!

In [None]:
first_name = "Franz"
last_name = "Kafka"
full_name = first_name + last_name
print(full_name)

Remember that computers don't understand context.

In [None]:
full_name = first_name + " " + last_name
print(full_name)

## Strings are made up of sub-strings

You can think of strings as a [sequence](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#sequence) of smaller strings or characters. We can access a piece of that sequence using square brackets `[]` just like we would an item in a `list`:

In [None]:
full_name[1]

<div class="alert alert-danger">
Don't forget, Python (and many other languages) start counting from 0.
</div>

In [None]:
full_name[0]

In [None]:
full_name[4]

## You can slice strings using  `[ : ]`

If you want a range (or "slice") of a sequence, you get everything *before* the second index, i.e,. Python slicing is *exclusive*:

In [None]:
full_name[0:4]

In [None]:
full_name[0:5]

You can see some of the logic for this when we consider implicit indices.

In [None]:
full_name[:5]

In [None]:
full_name[5:]

If we want to find out how long a string is, we can use the `len` function:

In [None]:
len(full_name)

## Strings have methods

* There are other operations defined on string data. These are called **string [methods](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#method)**. 
* The Jupyter Notebooks lets you do tab-completion after a dot ('.') to see what methods an [object](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#object) (i.e., a defined variable) has to offer. Try it now!

In [None]:
str.

Let's look at the `upper` method. What does it do? Let's take a look at the documentation. Jupyter Notebooks let us do this with a question mark ('?') before *or* after an object (again, a defined variable).

In [None]:
str.upper?

So we can use it to upper-caseify a string. 

In [None]:
full_name.upper()

You have to use the parenthesis at the end because upper is a method of the string class.
<p></p>
<div class="alert alert-danger">
Don't forget, simply calling the method does not change the original variable, you must *reassign* the variable:
</div>

In [None]:
print(full_name)

In [None]:
full_name = full_name.upper()
print(full_name)

For what it's worth, you don't need to have a variable to use the `upper()` method, you could use it on the string itself.

In [None]:
"Franz Kafka".upper()

What do you think should happen when you take upper of an int?  What about a string representation of an int?

In [None]:
1.upper()

In [None]:
"1".upper()

## Challenge 1: Write your name

1. Make two string variables, one with your first name and one with your last name.
2. Concatenate both strings to form your full name and [assign](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#assign) it to a variable.
3. Assign a new variable that has your full name in all upper case.
4. Slice that string to get your first name again.

## Challenge 2: Try seeing what the following string methods do:

    * `split`
    * `join`
    * `replace`
    * `strip`
    * `find`

In [None]:
my_string = "It was a Sunday morning at the height of spring."

## Challenge 3: Working with strings

Below is a string of Edgar Allen Poe's "A Dream Within a Dream":

In [None]:
poem = '''Take this kiss upon the brow!
And, in parting from you now,
Thus much let me avow —
You are not wrong, who deem
That my days have been a dream;
Yet if hope has flown away
In a night, or in a day,
In a vision, or in none,
Is it therefore the less gone?  
All that we see or seem
Is but a dream within a dream.

I stand amid the roar
Of a surf-tormented shore,
And I hold within my hand
Grains of the golden sand —
How few! yet how they creep
Through my fingers to the deep,
While I weep — while I weep!
O God! Can I not grasp 
Them with a tighter clasp?
O God! can I not save
One from the pitiless wave?
Is all that we see or seem
But a dream within a dream?'''

What is the difference between `poem.strip("?")` and `poem.replace("?", "")` ?

At what index does the word "*and*" first appear? Where does it last appear?

How can you answer the above accounting for upper- and lowercase?

## Challenge 4: Counting Text

Below is a string of Robert Frost's "The Road Not Taken":

In [None]:
poem = '''Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;

Then took the other, as just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,

And both that morning equally lay
In leaves no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way,
I doubted if I should ever come back.

I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference.'''

Using the `len` function and the string methods, answer the following questions:

How many characters (letters) are in the poem?

How many words?

How many lines? (HINT: A line break is represented as  `\n`  )

How many stanzas?

How many unique words? (HINT: look up what a `set` is)

Remove commas and check the number of unique words again. Why is it different?

---

## Bibliography


-  Notebook heavily adapted from materials by Chris Hench (https://github.com/henchc/textxd-2017, notebooks 01 and 02)

---
Notebook developed by: Keeley Takimoto

Data Science Modules: http://data.berkeley.edu/education/modules
