# Starting out with Python

Our goal is to learn enough Python to extract data from texts, manipulate it, and generate results. To do that, we have to start (more or less) at the beginning.

## Input and Output
Fundamentally we will be passing commands to the computer (input), and viewing or manipulating the results of its work (output). In Jupyter, we pass input in `cells` and see the output printed below.

For example:

Above is a blank `cell`. If you hit `enter` or click into the cell, you will be able to begin typing:

In [3]:
100

100

In the example above, our `input` was `100`. I then hit `shift + enter` to execute the cell (you can also click the "play" button above). The output prints below the  cell (`100`). Of course, the point is usually to transform our `inputs` into new `outputs`:

In [4]:
100 + 1

101

## Jupyter Shortcuts
You can use keyboard shortcuts to control a lot of different parts of the Jupyter notebook. Below are the few that will be valuable to memorize for everyday use:

| command | description |
|----------|-----------------------|
`enter` | enter edit mode for the selected cell
`esc`|exit edit mode for the selected cell
`shift + enter`|execute the code inside of the cell
`b`|insert a new cell below your current cell



## Built-in objects

Python has a large number of built-in objects that are used for a variety of tasks. We're going to start with `strings` and `numbers`.

### Strings
A "string" is a computational representation of letters and other symbols. We can put texts of any length in a string: single letters, sentences, whole books, etc. In Python, we represent strings by marking them with `'` or `"` marks. For instance:
```
'this is a string'
```
And:
```
"this is also a string"
```
How do we know? We can have Python check the `type` of the object:

In [1]:
type('this is a string')

str

In [2]:
type("this is also a string")

str

Note that strings can contain other characters besides letters:

In [5]:
"This is also a string despite all this: 0238410293874&^^$#@%^%(*(_@*))"

'This is also a string despite all this: 0238410293874&^^$#@%^%(*(_@*))'

Sometimes we might want to type or paste in long strings. That can be accomplished with `'''` or `"""` at the beginning and end:

In [6]:
'''
Because I could not stop for Death,
He kindly stopped for me;
The carriage held but just ourselves
And Immortality.
'''

'\nBecause I could not stop for Death,\nHe kindly stopped for me;\nThe carriage held but just ourselves\nAnd Immortality.\n'

In [8]:
"""
We slowly drove, he knew no haste,
And I had put away
My labor, and my leisure too,
For his civility.

-Emily Dickinson
"""

'\nWe slowly drove, he knew no haste,\nAnd I had put away\nMy labor, and my leisure too,\nFor his civility.\n\n-Emily Dickinson\n'

Notice what Python outputs:
```
'\nWe slowly drove, he knew no haste,\nAnd I had put away\n
```
Those `\n` characters represent whitespace, specifically a newline. Notice they occur at each of the line-breaks in the poem. We will come back to this later.

### Manipulating strings
Sometimes we will want to combine strings with each other. We can do that in a number of ways:

In [9]:
'hello' + 'there'

'hellothere'

Notice what happens: `hello` is "added" to `there`. Unlike us, the computer does not assume that these are two separate words: they are strings to be combined using `+`.

We can fix this in a number of ways:

In [10]:
'hello' + ' ' + 'there'

'hello there'

In [11]:
'hello ' + 'there'

'hello there'

In [12]:
'hello' + ' there'

'hello there'

We could even try playing around with that newline character to separate the words. This doesn't print the whitespace:

In [16]:
'hello' + '\n' + 'there'

'hello\nthere'

But `print` does:

In [15]:
print('hello' + '\n' + 'there')

hello
there


## Comments
It's good practice to write comments in your code describing what it does. It will help your peers (and your future self).

We indicate a comment with a `#`. The computer does not do anything with comments; they're only for humans.

In [26]:
# this is a comment

print('this will not print the comments')

# see?

this will not print the comments


## Numbers
Numbers can be typed directly:

In [19]:
1

1

In [33]:
1 + 1

2

Most math will work as you would expect:

In [21]:
1 - 1

0

In [27]:
1 * 5 # * means multiplication

5

In [28]:
1 / 5 # / means division

0.2

As we learned above, numbers in quotes *are not treated as numbers*. They are strings:

In [32]:
'1' + '1'

'11'

We often deal with fractional numbers. Python has multiple ways of representing numbers. The two that we will primarily be concerned with are `integers` and `floating point numbers` or `floats`.

In [34]:
type(1)

int

In [35]:
type(1.0)

float

Operations on integers can produce floats:

In [36]:
type(1/5)

float

Operations on floats will not produce integers usually:

In [38]:
0.5+0.5

1.0

In [37]:
type(0.5+0.5)

float

# Variables
Now we can use Python as a calculator, and to print strings. Variables make it possible for us to store values, and change them over time.

## What is a variable?
Swiss semiotician Ferdinand de Saussure divided the idea of a sign into two components: the signifier and the signified. The sign "tree" is a signifier. An actual tree (bark, leaves, etc.) is its signified.

One of Saussure's great contributions is the observation that the relationship between signifier and signified is *arbitrary*. "Tree" is not clearly a better signifier than "nbawlks" for the thing we call a tree, except by convention.

The arbitrary relation between the signifier and the signified allows us to choose variable names to represent many different objects:

In [42]:
tree = """
          &&& &&  & &&
      && &\/&\|& ()|/ @, &&
      &\/(/&/&||/& /_/)_&/_&
   &() &\/&|()|/&\/ '%" & ()
  &_\_&&_\ |& |&&/&__%_/_& &&
&&   && & &| &| /& & % ()& /&&
 ()&_---()&\&\|&&-&&--%---()~
     &&     \|||
             |||
             |||
             |||
       , -=-~  .-^- _

"""

In [44]:
print(tree)


          &&& &&  & &&
      && &\/&\|& ()|/ @, &&
      &\/(/&/&||/& /_/)_&/_&
   &() &\/&|()|/&\/ '%" & ()
  &_\_&&_\ |& |&&/&__%_/_& &&
&&   && & &| &| /& & % ()& /&&
 ()&_---()&\&\|&&-&&--%---()~
     &&     \|||
             |||
             |||
             |||
       , -=-~  .-^- _




## Using variables

In [45]:
first_name = 'Erik'

In [46]:
last_name = 'Fredner'

In [47]:
first_name + last_name

'ErikFredner'

In [49]:
favorite_number = 33

In [50]:
favorite_number / 3

11.0

To keep going with Saussure, the arbitrary relation between the signifier and the signified enables *reassignment*. Consider the following:

In [53]:
first_name

'Erik'

What will happen if I set `first_name` equal to something else?

In [54]:
first_name = 'Eric'

In [55]:
first_name

'Eric'

Variables can be overwritten. This allows you to use one variable name for *every element* in an operation. Below is a more complex example. Pay attention to how the variable name is being used:

In [60]:
favorite_things = ['my friends', 'breakfast', 'my cats']

In [62]:
for thing in favorite_things:
    print(thing)
    print('One of my favorite things is ' + thing + '!')

my friends
One of my favorite things is my friends!
breakfast
One of my favorite things is breakfast!
my cats
One of my favorite things is my cats!


This example contains several things we have not discussed yet, including `lists` and `for` loops.

I want you to pay attention to how the variable `thing` works. For each of the items in the list `favorite_things`, the above script prints the variable `thing`, followed by a string containing `thing`.

`thing` changes over time. The first `thing` is `'my friends'`. The second is `'breakfast'`. And so on.

Not everything can be reassigned. For instance:

In [63]:
a = 1

In [64]:
a

1

In [65]:
a = 2

In [66]:
a

2

In [67]:
1 = 2

SyntaxError: can't assign to literal (<ipython-input-67-c0ab9e3898ea>, line 1)

# Doing things with strings
So far we have only learned to combine strings with `+`. We can do lots of other things with them. One of the most important is `slicing`, which allows us to cut strings into pieces. For instance, let's say we wanted to get my initials:

In [68]:
first_name = 'Erik'

In [69]:
last_name = 'Fredner'

In [70]:
first_initial = first_name[0]
last_initial = last_name[0]
initials = first_initial+last_initial
print(initials)

EF


The `[]` characters are used to indicate the `index` of the string that we want to access. Strings are indexed from 0, with each subsequent character indicated by the subsequent integer. So:

In [71]:
first_name[1]

'r'

It helps to visualize it:

We could rearrange these to spell a new word:

In [73]:
first_name[2] + first_name[1] + first_name[3]

'irk'

We can also `slice` a string using indices:

In [74]:
last_name[0:4]

'Fred'

*Common point of confusion*

People who are new to Python are often perplexed by the way that slicing works. Let's make the same chart for `last_name`:

So why does `last_name[0:4]` return `Fred`? Shouldn't it be `Fredn`? The best way to think about it is that the general form `X[I:J]` says: "give me everything in `X` from index `I` up to *but not including* index `J`." To get the last three letters, for instance:

In [76]:
last_name[4:7]

'ner'

You will frequently see code written like this:

In [77]:
last_name[:4]

'Fred'

In [78]:
last_name[4:]

'ner'

Leaving either side of the colon blank indicates zero on the left side and the final character on the right side. So:

In [79]:
last_name[:]

'Fredner'

We can also index using negative numbers. Typically, this is most useful to get the last element in a string or other object:

In [75]:
'the last letter of this string is z'[-1]

'z'

# Finding substrings
We frequently need to find specific sequences of characters in a string. For instance, let's say we wanted to find the word "be" in Hamlet's famous speech:

In [80]:
ham = """
To be, or not to be, that is the question,
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them?
"""

In [81]:
ham

"\nTo be, or not to be, that is the question,\nWhether 'tis nobler in the mind to suffer\nThe slings and arrows of outrageous fortune,\nOr to take arms against a sea of troubles,\nAnd by opposing end them?\n"

We can use the `find` method to search for the string we're interested in:

In [83]:
ham.find('be')

4

`find` gives us the `index` where the string we pass it, `'be'`, first occurs. So:

In [84]:
ham[4:6]

'be'

Of course, the speech famously has multiple instances of `be`. We can find the second one like so:

In [89]:
ham.find('be',5)

18

What is going on here? `find` takes two arguments, `'be'` and `5`. `5` represents the index at which `find` begins looking for the next instance of `be`. It finds it at `18`, which means that:

In [92]:
ham[18:20]

'be'

Is there another `be` in this part of Hamlet's speech?

In [93]:
ham.find('be',19)

-1

`find` returns `-1` when it does not find another instance.

# Measuring strings
We frequently need to know how long strings are. We can find out with the `len` function:

In [94]:
len(ham)

200

`ham` is 200 characters long. We can repeat it using `*`:

In [95]:
two_ham = ham * 2

In [99]:
'z'*10 # the same principle

'zzzzzzzzzz'

In [96]:
len(two_ham)

400

# Cleaning strings
We frequently need to reformat strings in order to accurately characterize them. For example, when we count words, we would intuitively understand that `The` and `the` are two instances of the same word. But computers cannot: they would count those as one instance of `The` and one of `the`. To make computer counting reflect human counting, we often make strings uniformly lowercase.

In [113]:
nasx = 'I\'m gonna take my horse to the Old Town Road. I\'m gonna ride til I can\'t no more. '

Note that we need to use the escape character `\` to allow us to use `'` to abbreviate words like "I'm" and "can't." Otherwise, Python would think we were closing the string.

In [114]:
nasx.lower()

"i'm gonna take my horse to the old town road. i'm gonna ride til i can't no more. "

In [115]:
nasx.upper()

"I'M GONNA TAKE MY HORSE TO THE OLD TOWN ROAD. I'M GONNA RIDE TIL I CAN'T NO MORE. "

Strings like Hamlet's speech often have whitespace characters like `\n` that we don't care about for the purpose of counting words:

# Replacing substrings
We will also need to replace portions of strings:

In [116]:
nasx.replace('horse', 'Maserati')

"I'm gonna take my Maserati to the Old Town Road. I'm gonna ride til I can't no more. "

If we wanted to make multiple changes at one time, we can call the methods in sequence:

In [117]:
nasx.replace('horse','Maserati').replace('ride','drive')

"I'm gonna take my Maserati to the Old Town Road. I'm gonna drive til I can't no more. "

Whitespace characters like `\n` provide a real-world example of how this can be useful for text mining:

In [120]:
ham

"\nTo be, or not to be, that is the question,\nWhether 'tis nobler in the mind to suffer\nThe slings and arrows of outrageous fortune,\nOr to take arms against a sea of troubles,\nAnd by opposing end them?\n"

In [124]:
ham.replace('\n',' ')

" To be, or not to be, that is the question, Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? "

In the above example I chose to `replace` `\n` with a space. Of course, that leaves spaces at the beginning and the end. We can get rid of those with `strip`:

In [125]:
ham.replace('\n',' ').strip()

"To be, or not to be, that is the question, Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them?"

# Counting substrings
Counting words is one of our core techniques. The American author Gertrude Stein has a famous line about roses and language:

In [128]:
stein = 'Rose is a rose is a rose is a rose.'

We can combine some of our techniques and add new ones to accurately count the number of roses in the line:

In [129]:
stein = stein.lower() # lowercasing to count 'Rose' as 'rose'

In [130]:
stein

'rose is a rose is a rose is a rose.'

In [131]:
stein.count('rose')

4

We can also use this as a way to get very rough statistics about a text. Let's figure out how many sentences there are in the first chapter of [*Huckleberry Finn*](https://www.gutenberg.org/files/76/76-h/76-h.htm#c1).

You need to download the chapter from Canvas, which is called `huck.txt`.

In [None]:
filepath = '/Users/e/Downloads/huck.txt'

In [166]:
huck = open(filepath)

In [149]:
huck

<_io.TextIOWrapper name='huck.txt' mode='r' encoding='UTF-8'>

After using Python to `open` a file, we then need to `read` it in order to access the text.

In [150]:
huck = huck.read()

In [164]:
huck[:50] # to see the first fifty characters of the chapter

'CHAPTER I.\nYOU don’t know about me without you hav'

In [152]:
len(huck) # remember, len() provides the number of characters

7073

In [153]:
periods = huck.count('.') # rough measure of sentences

In [154]:
periods

66

In [155]:
huck.count('?')

1

In [156]:
huck.count('!')

5

In [157]:
huck.count(',')

92

In [158]:
huck.count(';')

18

In [159]:
spaces = huck.strip().count(' ') # counting spaces provides a very rough measure of words

In [160]:
spaces / periods # estimates average words per sentence

22.015151515151516

We will improve on the accuracy of all of these measures over time.

# Summary
We learned how Python represents integers, strings, and comments; how indexing works; how to assign variables; and some basic techniques for manipulating and measuring strings.