# Text Analysis

In this module, we will use the Natural Language Toolkit Library (NLTK) to look at individual words and sentences in a text and clean unneccessary features from the text data to prepare for sentiment analysis. Then using the textblob library, we will analyze the sentiment of opinioned data to give a numerical value for use in a predictive model.

#### Tokenizing Words and Sentences

Recall in the "Python Dictionaries and String Manipulation" notebook, we used the .split() function to break a sentence apart.

The NLTK library was built to separate punctuation from words when tokenizing (splitting into parts).

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.tokenize import TweetTokenizer
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

#this is sample data
from nltk.corpus import names  

from string import punctuation

#if the next cell does not work
#remove number symbol on following lines and re-run this cell
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('names')
nltk.download('stopwords')
nltk.download('vader_lexicon')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\GBTC406007ur\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\GBTC406007ur\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package names to
[nltk_data]     C:\Users\GBTC406007ur\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\GBTC406007ur\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\GBTC406007ur\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [13]:
#[create an empty list here to hold the tokens at the end]
newls = []

with open('datasets_12dancingprincesses.txt', 'r') as f:
    for line in f:
        cline = line.strip() #get rid of newline character

        if cline == '': pass  #this will skip over lines that only had a newline and are now blank
        else:
            tknls = word_tokenize(cline)

        for token in tknls:
        #[write the function to append each token to the empty list you created at the start of this code]
        
            newls.append(token)
        

In [10]:
f

<_io.TextIOWrapper name='datasets_12dancingprincesses.txt' mode='r' encoding='cp1252'>

In [11]:
print(f)

<_io.TextIOWrapper name='datasets_12dancingprincesses.txt' mode='r' encoding='cp1252'>


In [14]:
newls[:20]

['THE',
 'TWELVE',
 'DANCING',
 'PRINCESSES',
 'THE',
 'TWELVE',
 'DANCING',
 'PRINCESSES',
 'There',
 'was',
 'a',
 'king',
 'who',
 'had',
 'twelve',
 'beautiful',
 'daughters',
 '.',
 'They',
 'slept']

In [16]:
len(newls)

1970

In [17]:
#the NLTK FreqDist gives a count for how often each part of the text occurs
fd_wct = FreqDist(newls)
fd_wct

FreqDist({'the': 138, ',': 107, 'and': 74, '.': 50, 'to': 47, ';': 35, 'he': 34, 'they': 31, '’': 31, 'of': 29, ...})

In [19]:
#shows the top 10 words in the text
fd_wct.most_common(10)

[('the', 138),
 (',', 107),
 ('and', 74),
 ('.', 50),
 ('to', 47),
 (';', 35),
 ('he', 34),
 ('they', 31),
 ('’', 31),
 ('of', 29)]

The most common parts of this text seem to be filler words and punctuation. We need to remove them to get a better understand of what the text is about.

In [20]:
#remove the puntuation tokens from the list
for word in newls:
    if word in punctuation:
        newls.remove(word)

In [21]:
#number of tokens in list after puntuation removal
len(newls)

1763

In [22]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [23]:
#list of english stopwords
eng_stopwords = stopwords.words('english')
eng_stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [28]:
rm_count = 0
new_words = []  #list to hold new words

for word in newls:
    if word not in eng_stopwords:
        new_words.append(word)
    else: rm_count += 1

In [29]:
rm_count

927

In [30]:
len(new_words)

836

In [32]:
fd_nw = FreqDist(new_words)
fd_nw.most_common(12)

[('’', 31),
 ('‘', 22),
 ('soldier', 19),
 ('I', 18),
 ('princesses', 17),
 ('said', 16),
 ('king', 15),
 ('went', 11),
 ('twelve', 10),
 ('came', 10),
 ('eldest', 10),
 ('night', 8)]