We'll often want this magical line at the start of our notebooks.
It makes plots show up right in the notebook. We might as well get used to it.

In [None]:
%matplotlib inline

# The TextBlob Library

Other people have written python programs that allow us to do complicated things with just a line or two python. 

We are going to start with a library named [TextBlob](https://textblob.readthedocs.io/en/dev/).

In [None]:
from textblob import TextBlob

In [None]:
my_blob = TextBlob("Text mining makes me happy for some reason. I guess it's a good thing that I'm so \
easily entertained :-).")

`my_blob` is an object that can tell us many things about itself.

In [None]:
my_blob.sentences

In [None]:
my_blob.words

In [None]:
my_blob.sentences[0].words

In [None]:
print(my_blob.tags)

Tag Meanings
1.	CC	Coordinating conjunction
2.	CD	Cardinal number
3.	DT	Determiner
4.	EX	Existential there
5.	FW	Foreign word
6.	IN	Preposition or subordinating conjunction
7.	JJ	Adjective
8.	JJR	Adjective, comparative
9.	JJS	Adjective, superlative
10.	LS	List item marker
11.	MD	Modal
12.	NN	Noun, singular or mass
13.	NNS	Noun, plural
14.	NNP	Proper noun, singular
15.	NNPS	Proper noun, plural
16.	PDT	Predeterminer
17.	POS	Possessive ending
18.	PRP	Personal pronoun
19.	PRP	Possessive pronoun
20.	RB	Adverb
21.	RBR	Adverb, comparative
22.	RBS	Adverb, superlative
23.	RP	Particle
24.	SYM	Symbol
25.	TO	to
26.	UH	Interjection
27.	VB	Verb, base form
28.	VBD	Verb, past tense
29.	VBG	Verb, gerund or present participle
30.	VBN	Verb, past participle
31.	VBP	Verb, non-3rd person singular present
32.	VBZ	Verb, 3rd person singular present
33.	WDT	Wh-determiner
34.	WP	Wh-pronoun
35.	WP$	Possessive wh-pronoun
36.	WRB	Wh-adverb

In [None]:
my_blob.sentiment

* **Polarity** is number which lies in the range [-1,1] where 1 means positive sentiment and -1 means a negative sentiment. 
* **Subjective** sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1].

# Working with Text Files

These lines read a text file. The first line creates a file object that points to the file. The second line reads in the contents of that file and assigns it to a variable named `genesis_raw`.

In [None]:
myfile = open('corpora/genesis.txt')
genesis_raw = myfile.read()

`genesis_raw` will be a string with every character in genesis. 
Let's see how many characters it is:

In [None]:
len(genesis_raw)

We can display the first 100 characters:

In [None]:
genesis_raw[:100]

Let's make it into a TextBlob

In [None]:
gb = TextBlob(genesis_raw)

Now count some things

In [None]:
len(gb.words)

In [None]:
len(gb.sentences)

We can make a plot using a super-powerful library called [pandas](https://pandas.pydata.org). I won't explain this all now.

In [None]:
import pandas as pd
gb_df = pd.DataFrame.from_dict(gb.word_counts, orient='index', columns=['count'])
sdf = gb_df.sort_values(by='count', ascending=False)[:30]
sdf.plot.bar(figsize=[10, 5])

### Dispersion plot with nltk

Another library that overlap with Textblob is [nltk](nltk.org). It does some things that TextBlob doesn't do, but is just a tad more complicated. Here we'll use it to make a dispersion plot. and a concordance.

In [None]:
from nltk.draw import dispersion_plot
dispersion_plot(gb.words, ["Adam", "Noah"])

In [None]:
from nltk.text import ConcordanceIndex
ci = ConcordanceIndex(gb.words)
ci.print_concordance("Adam", width=80, lines=25)

## The course site

It lives on github at https://github.com/bsherin/text_mining_content