## Getting Started in Jupyter

To run code in a selected cell, click '▶' above or press **`shift+return`**.

Run the next few cells, demonstrate the basics of Python syntax. Everything to the right of **`'#'`** is a comment.

In [None]:
print "Hello Jupyter!"
print "Hello"+" Jupyter!"

In [None]:
# Integer Arithmetic

int1=10000
int2=150

print int1+int2 # add
print int1-int2 # subtract
print int1*int2 # multiply
print int1/int2 # divide

In [None]:
# Floating Point Arithmetic

float1=10000.0
float2=150.5

print float1+float2 # add
print float1-float2 # subtract
print float1*float2 # multiply
print float1/float2 # divide

In [None]:
# String Manipulation
## In which we split and reassemble a sentence.

sentence="A green hunting cap squeezed the top of a fleshy balloon of a head."
print sentence

In [None]:
words=sentence.split(" ")
print words

In [None]:
rejoined=" ".join(words)
print rejoined

In [None]:
# List Slice Notation

print len(words)

print words[4]
print words[2:5]
print words[2:]
print words[:2]
print words[-4]
print words[-4:]

#### ▷ Lines beginning with **`'!'`** are executed in Bash, i.e., the default Unix environment you're in when you open a new Terminal window.

In [None]:
# The following Bash command displays your username.

!whoami

In [None]:
# And this one prints a list of files on your desktop.

!ls /Users/yourname/Desktop/   ### Swap in your username here. ###

In [None]:
# The following simplified format does the same.

# Note that this shortcut works in Bash but not in Python's 'os' module.

!ls ~/Desktop/

#### ▷  Python can execute Bash commands as well, via the `os` module:

In [None]:
# This cell imports the 'os' module, then prints a list of filenames on the desktop.

import os

os.chdir("/Users/yourname/Desktop/") ### Swap in your username here. ###
filenames = os.listdir("/Users/yourname/Desktop/")  ### Swap in your username here. ###
print filenames

In [None]:
# The pprint module shows us the list above in a more readable format.

from pprint import pprint

pprint(filenames)

In [None]:
# If your desktop is filled with screen capture files, uncomment and run the lines below 
# to create a 'Screenshots' directory and send future captures there by default.

# !mkdir ~/Desktop/Screenshots
# !defaults write com.apple.screencapture location ~/Desktop/Screenshots ; killall SystemUIServer

### Basic Text I/0
#### ▷ In the following demonstrations we'll be loading plain text files from the Web like so:


In [None]:
import urllib2

url="http://principalhand.org/workshop-data/Melville_Moby-Dick.txt"

melville_string=urllib2.urlopen(url).read()

#### ▷ And here's how to work with text files on your local system:

In [None]:
# First, this Bash command creates a two-line text file on your desktop.

!echo "This is the first line.\nThis is the second line." > ~/Desktop/test_file.txt

In [None]:
# The shortest format for creating a string from a text file:

text=open("/Users/yourname/Desktop/test_file.txt").read()  ### Swap in your username here. ###

print text

In [None]:
# Or we can work with one line at a time. Note that the newline character at the end of the
# first line ends up creating a gap when the lines are printed separately.

with open("/Users/yourname/Desktop/test_file.txt") as fi:  ### Swap in your username here. ###
    for line in fi:
        print line

In [None]:
# Load a text file as a list of lines, discarding newline characters.

line_list=open("/Users/yourname/Desktop/test_file.txt").read().splitlines()  ### Swap in your username here. ###

print line_list

In [None]:
# And we can write string data to a new text file like so:

fo=open("/Users/yourname/Desktop/test_file_2.txt","w")  ### Swap in your username here. ###

fo.write("This is another first line.\n")
fo.write("This is another second line.")

fo.close()

# A file called "test_file_2.txt" should appear on your desktop.

## Streamlined Text Processing with TextBlob

Run the following to install TextBlob and download a set of sample corpora for the current user. To install the module for all users, delete the `--user` option and run the commands in the terminal.

In [None]:
!pip install -U --user textblob
!python -m textblob.download_corpora

In [None]:
# Let's relaunch Python so we can access our newly installed modules.

quit()

### ▷ Creating a TextBlob object

In [None]:
from textblob import TextBlob
from pprint import pprint

paragraph='''"There, there, I shall find some employment, although it will not necessarily be what you would call a good job. I may have some valuable insights which may benefit my employer. Perhaps the experience can give my writing a new dimension. Being actively engaged in the system which I criticize will be an interesting irony in itself." Ignatius belched loudly. "If only Myrna Minkoff could see how low I've fallen."'''

blob_1 = TextBlob(paragraph)

print blob_1

In [None]:
# 'blob.words' is a list of words.

print blob_1.words[:25]

In [None]:
# 'blob.sentences' is a list of Sentence objects.

pprint(blob_1.sentences)

### ▷ POS tagging

You can find a list of POS tags here: http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [None]:
# 'blob_1.tags' is a list of NLTK's best guess for each word's part of speech (POS).
# The following prints the first 20 word-tag pairs in our text.

from pprint import pprint

pprint(blob_1.tags[:20])

In [None]:
print blob_1.noun_phrases

In [None]:
# Parse a sentence's grammar in tree form.

blob_1.parse()

### ▷ Now let's work with a longer text.

In [None]:
# Downloading Melville's _Moby Dick_

import urllib2

url="http://principalhand.org/workshop-data/Melville_Moby-Dick.txt"

melville_string=urllib2.urlopen(url).read()

In [None]:
# Create a TextBlob object and print a random sentence.

from textblob import TextBlob
import random

melville_blob = TextBlob(melville_string)
print random.sample(melville_blob.sentences,1)

In [None]:
# Return the number of times a given word appears in a text.

print melville_blob.words.count('the')

In [None]:
# View the most frequently occurring words in a text. Note that this is approach is 
# case-sensitive.

from collections import Counter

print Counter(melville_blob.words).most_common(25)

In [None]:
# Here's a non-case-sensitive version of the command above, which works by converting the
# full text to lowercase before calculating string frequencies.

print Counter(melville_blob.words.lower()).most_common(25)

## Removing Stop Words

Now let's view the most frequent words in our corpus with stopwords removed.

The first time you run the cell below, uncomment the second line to download all nltk corpora and packages.


In [None]:
import nltk

#!python -m nltk.downloader all

In [None]:
# Loading stop word list

from operator import itemgetter
from nltk.corpus import stopwords
from textblob import TextBlob
from textblob import Word

stopwords_eng=stopwords.words('english')+["'s"] ## Adding "'s"  as a stop word
print sorted(stopwords_eng)

In [None]:
# Creates a new Moby Dick TextBlob object (just for convenience)

import urllib2
url="http://principalhand.org/workshop-data/Melville_Moby-Dick.txt"
melville_string=urllib2.urlopen(url).read()


# Create a TextBlob object and print a random sentence.
from textblob import TextBlob
import random

melville_blob = TextBlob(melville_string)
print random.sample(melville_blob.sentences,1)

In [None]:
# Creates a copy of our word tally list with stopwords removed.
from collections import Counter
from textblob import Word
from pprint import pprint

most_freq=Counter(melville_blob.words.lower()).most_common()

most_freq_ns=[]

for pair in most_freq:
    word=pair[0].lower()
    pre_apostrophe=Word(word).split("'"[0]) # 
    if not (word in stopwords_eng)|(pre_apostrophe in stopwords_eng):
        most_freq_ns.append(pair)

        
print len(most_freq_ns)
pprint(most_freq_ns[:25])

In [None]:
# Creating a function that applies the process above to any TextBlob object.

def most_freq_no_stop(blob):
    stopwords_eng=stopwords.words('english')+["'s"]
    most_freq=Counter(blob.words.lower()).most_common()
    
    most_freq_no_stop=[]

    for pair in most_freq:
        word=pair[0].lower()
        pre_apostrophe=Word(word).split("'"[0])
        if not (word in stopwords_eng)|(pre_apostrophe in stopwords_eng):
            most_freq_no_stop.append(pair)
    
    return most_freq_no_stop

In [None]:
pprint(most_freq_no_stop(melville_blob)[:25])

#### ▷ Let's load another text for comparison.

In [None]:
import urllib2

url="http://principalhand.org/workshop-data/Austen_Persuasion.txt"

austen_string=urllib2.urlopen(url).read()

In [None]:
#### This cell will throw an error. Don't panic! ####

austen_blob = TextBlob(austen_string)
pprint(most_freq_no_stop(austen_blob)[:30])

#### ▷ The cell above will produce a 'UnicodeDecodeError.' To fix the problem, we can apply the "decode()" function to our string before passing it to the TextBlob constructor.


In [None]:
url="http://principalhand.org/workshop-data/Austen_Persuasion.txt"

austen_string=urllib2.urlopen(url).read().decode("utf8")

austen_blob = TextBlob(austen_string)

pprint(most_freq_no_stop(austen_blob)[:25])

### ▷ Yet another word frequency list

In [None]:
import urllib2

url="http://principalhand.org/workshop-data/Gilman_Yellow-Wallpaper.txt"

gilman_string=urllib2.urlopen(url).read().decode("utf8")

gilman_blob = TextBlob(gilman_string)

pprint(most_freq_no_stop(gilman_blob)[:25])

### ▷ Sanitizing Input

Variations in text encoding formats constantly cause trouble, so we'll define a function to clean up a given string.

First we'll install the unidecode module, which converts non-ASCII characters like curly quotes

In [None]:
!pip install --user -U unidecode

In [None]:
# Let's relaunch Python so we can access our newly installed module.

quit()

In [None]:
from unidecode import unidecode
import bleach

def sanitize(text):
    x=unidecode(bleach.clean(text)) ## Applying two tools together to create proper ASCII text.
    x=x.replace("\n"," ").replace("\r"," ").strip() ## Replacing line breaks spaces and stripping leading and trailing whitespace
    while "  " in x:
        x=x.replace("  "," ")      ## Replacing all sequences of spaces with a single space
    return x


In [None]:
# Let's make sure our sanitizing function works.

print sanitize(u"Gz\n    7📎2“N\u0303o”I   JÉX🐛vp")
print sanitize("5KzMs  BCzsbR   uHJINE8")

# If this shows an error on the first try, run it again and see if it behaves.

### ▷ Creating a concordance with NLTK

In [None]:
import nltk
import urllib2

url="http://principalhand.org/workshop-data/Stein_Three-Lives.txt"
temp_string=urllib2.urlopen(url).read().decode('utf8')

raw=sanitize(temp_string)


nltk_text = nltk.Text([sanitize(temp_string)])
tokens = nltk.word_tokenize(raw)
nltk_text = nltk.Text(tokens)


print nltk_text.concordance('blood')


### ▷ Simple sentiment analysis with TextBlob

In [None]:
# Negative polarity example
from textblob import TextBlob

text="This is a very mean and nasty sentence."

blob = TextBlob(sanitize(text))

# result between -1 and +1
sentiment_score=blob.sentiment.polarity  # <--

print sentiment_score

In [None]:
# Positive polarity example

text="This is a nice and positive sentence."

blob = TextBlob(sanitize(text))

# result between -1 and +1
sentiment_score=blob.sentiment.polarity  # <--

print sentiment_score

In [None]:
# High subjectivity example

text="This is a very mean and nasty sentence."

blob = TextBlob(sanitize(text))

# result between 0 and +1
sentiment_score=blob.sentiment.subjectivity  # <--

print sentiment_score

In [None]:
# Low subjectivity example

text="This sentence states a fact."

blob = TextBlob(sanitize(text))

# result between -1 and +1
sentiment_score=blob.sentiment.subjectivity  # <--

print sentiment_score

### ▷ Plotting Sentiment Values

In [None]:
# Let's map sentiment ratings across the course of a full book.

# First, install matplotlib, numpy, and pandas packages.

!pip install --user -U matplotlib numpy pandas

In [None]:
# Let's relaunch Python so we can access our newly installed modules.

quit()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pprint import pprint

# Viewing available plot styles and selecting one to use.

pprint(plt.style.available)

plt.style.use('ggplot')

In [None]:
# Creates a new Moby Dick TextBlob object (just for convenience)

import urllib2
url="http://principalhand.org/workshop-data/Melville_Moby-Dick.txt"
melville_string=urllib2.urlopen(url).read()


# Create a TextBlob object and print a random sentence.
from textblob import TextBlob
import random

melville_blob = TextBlob(melville_string)
print random.sample(melville_blob.sentences,1)

In [None]:
melville_sentiments=[sentence.sentiment.polarity for sentence in melville_blob.sentences]
print melville_sentiments[:10]

In [None]:
plt.figure(figsize=(18,8))
plt.plot(melville_sentiments)

In [None]:
# Smoothing our data before plotting

melville_sentiments_pd=pd.Series(melville_sentiments)
melville_sentiments_smooth=melville_sentiments_pd.rolling(window=200).mean()

print melville_sentiments_smooth[195:210]

In [None]:
plt.figure(figsize=(18,8))
plt.plot(melville_sentiments_smooth)

In [None]:
max_sentiment=max(melville_sentiments_smooth[199:])

print max_sentiment # max sentiment polarity value

max_sent_index=list(melville_sentiments_smooth).index(max_sentiment) # index position of the 'max_sentiment' value

print melville_blob.sentences[max_sent_index]

In [None]:
min_sentiment=min(melville_sentiments_smooth[199:])

print min_sentiment # min sentiment polarity value

min_sent_index=list(melville_sentiments_smooth).index(min_sentiment) # index position of the 'min_sentiment' value

print melville_blob.sentences[min_sent_index]

In [None]:
austen_sentiments=[sentence.sentiment.polarity for sentence in austen_blob.sentences]
#print austen_sentiments[:10]
austen_sentiments_pd=pd.Series(austen_sentiments)
austen_sentiments_smooth=austen_sentiments_pd.rolling(window=200).mean()
#print austen_sentiments_smooth[190:210]

plt.figure(figsize=(18,8))
plt.plot(austen_sentiments_smooth)

In [None]:
max_sentiment=max(austen_sentiments_smooth[199:])
print max_sentiment # max sentiment polarity value

max_sent_index=list(austen_sentiments_smooth).index(max_sentiment) # index position of the 'max_sentiment' value
print austen_blob.sentences[max_sent_index]

In [None]:
min_sentiment=min(austen_sentiments_smooth[199:])
print min_sentiment # min sentiment polarity value

min_sent_index=list(austen_sentiments_smooth).index(min_sentiment) # index position of the 'min_sentiment' value
print austen_blob.sentences[min_sent_index]

In [None]:
# Creating functions to expedite the steps we put together above process
# These accept an optional second argument for smoothing level. Default is 200 windows.

def plot_polarity(text_in,window=200):
    blob = TextBlob(sanitize(text_in))
    sentiments=[sentence.sentiment.polarity for sentence in blob.sentences]
    sentiments_pd=pd.Series(sentiments)
    sentiments_smooth=sentiments_pd.rolling(window).mean()
    plt.figure(figsize=(18,8))
    plt.plot(sentiments_smooth)

def plot_subjectivity(text_in,window=200):
    blob = TextBlob(sanitize(text_in))
    sentiments=[sentence.sentiment.subjectivity for sentence in blob.sentences]
    sentiments_pd=pd.Series(sentiments)
    sentiments_smooth=sentiments_pd.rolling(window).mean()
    plt.figure(figsize=(18,8))
    plt.plot(sentiments_smooth)



In [None]:
# Persuasion Subjectivity

import urllib2
url="http://principalhand.org/workshop-data/Austen_Persuasion.txt"
temp_string=urllib2.urlopen(url).read()
temp_string=temp_string.replace("\r"," ").replace("\n"," ").replace("  "," ")


#plot_polarity(temp_string)
plot_subjectivity(temp_string)

In [None]:
plot_subjectivity(temp_string,10)

In [None]:
# Pride and Prejudice Subjectivity

import urllib2
url="http://www.gutenberg.org/cache/epub/1342/pg1342.txt"
temp_string=urllib2.urlopen(url).read()
temp_string=temp_string.replace("\r"," ").replace("\n"," ").replace("  "," ")

plt.figure(figsize=(20,10))
#plot_polarity(temp_string)
plot_subjectivity(temp_string)


In [None]:
# Emma Subjectivity

import urllib2
url="http://www.gutenberg.org/cache/epub/158/pg158.txt"
temp_string=urllib2.urlopen(url).read()
temp_string=temp_string.replace("\r"," ").replace("\n"," ").replace("  "," ")


#plot_polarity(temp_string)
plot_subjectivity(temp_string)

In [None]:
# Sense and Sensibility Subjectivity

import urllib2
url="http://www.gutenberg.org/cache/epub/161/pg161.txt"
temp_string=urllib2.urlopen(url).read()
temp_string=temp_string.replace("\r"," ").replace("\n"," ").replace("  "," ")


#plot_polarity(temp_string)
plot_subjectivity(temp_string)

In [None]:
# Subjectivity: New York Times Current History; The European War, Vol 2, No. 3, June, 1915

import urllib2
url="http://www.gutenberg.org/cache/epub/15480/pg15480.txt"
temp_string=urllib2.urlopen(url).read()
temp_string=temp_string.replace("\r"," ").replace("\n"," ").replace("  "," ")

#plot_polarity(temp_string)
plot_subjectivity(temp_string)

In [None]:
# Huckleberry Finn Polarity

import urllib2
url="https://www.gutenberg.org/files/76/76-0.txt"
temp_string=urllib2.urlopen(url).read()
temp_string=temp_string.replace("\r"," ").replace("\n"," ").replace("  "," ")


plot_polarity(temp_string)
#plot_subjectivity(temp_string)

### ▷ Plotting smoothed random data (for comparison)

In [None]:
## Plotting completely random data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

random_vals=np.random.rand(4000)

vals_pd=pd.Series(random_vals)
vals_smooth=vals_pd.rolling(window=200).mean()

plt.figure(figsize=(18,8))
plt.plot(vals_smooth)


### ▷ Sentiment Histograms

In [None]:
def hist_polarity(text_in):
    blob = TextBlob(sanitize(text_in))
    sentiments=[sentence.sentiment.polarity for sentence in blob.sentences]
    plt.figure(figsize=(20,10))
    plt.hist(sentiments_smooth)

def hist_subjectivity(text_in):
    blob = TextBlob(sanitize(text_in))
    sentiments=[sentence.sentiment.subjectivity for sentence in blob.sentences]
    plt.figure(figsize=(20,10))
    plt.hist(sentiments)

In [None]:
import urllib2
url="http://principalhand.org/workshop-data/Austen_Persuasion.txt"
temp_string=urllib2.urlopen(url).read()

In [None]:
hist_subjectivity(temp_string)

In [None]:
# These functions remove zero values before plotting.

def hist_polarity_filtered(text_in):
    blob = TextBlob(text_in.decode("utf8"))
    sentiments=[sentence.sentiment.polarity for sentence in blob.sentences]
    sentiments=[x for x in sentiments if x != 0]
    plt.figure(figsize=(15,8))
    plt.hist(sentiments)

def hist_subjectivity_filtered(text_in):
    blob = TextBlob(text_in.decode("utf8"))
    sentiments=[sentence.sentiment.subjectivity for sentence in blob.sentences]
    sentiments=[x for x in sentiments if x != 0]
    plt.figure(figsize=(15,8))
    plt.hist(sentiments)


In [None]:
hist_polarity_filtered(temp_string)

In [None]:
import urllib2
url="http://principalhand.org/workshop-data/Melville_Moby-Dick.txt"
melville_string=urllib2.urlopen(url).read()

In [None]:
hist_polarity_filtered(melville_string)

### ▷ Descriptive Stats

In [None]:
blob=melville_blob
melville_sentiments=[sentence.sentiment.subjectivity for sentence in blob.sentences]
np.mean(melville_sentiments)

In [None]:
blob=austen_blob
austen_sentiments=[sentence.sentiment.subjectivity for sentence in blob.sentences]
np.mean(austen_sentiments)

In [None]:
blob=gilman_blob
gilman_sentiments=[sentence.sentiment.subjectivity for sentence in blob.sentences]
np.mean(gilman_sentiments)

### ▷ Statistical Tests

In [None]:
# T-test of independent values

# Inappropriate in this case because zeroes in data make distribution non-normal.

from scipy import stats

print stats.ttest_ind(melville_sentiments,austen_sentiments)

print stats.ttest_ind(melville_sentiments,gilman_sentiments)

print stats.ttest_ind(austen_sentiments,gilman_sentiments)

In [None]:
# Mann-Whitney U test

# Designed to work for non-normally distrbuted data.

from scipy import stats

print stats.mannwhitneyu(melville_sentiments,austen_sentiments)

print stats.mannwhitneyu(melville_sentiments,gilman_sentiments)

print stats.mannwhitneyu(austen_sentiments,gilman_sentiments)

<a rel="license"
     href="http://creativecommons.org/publicdomain/zero/1.0/">
    <img src="http://i.creativecommons.org/p/zero/1.0/88x31.png" style="border-style: none;" alt="CC0" />
  </a>