**INTRO**

- unstructured text data extremely common, analyzing large quantities of text data is key way to understand what people are thinking help find trending news topics
- tweets on twitter, reviews on amazon helps users purchase best-rated products
These examples of organizing and structuring knowledge represent **Natural Language Processing (NLP) tasks**

NLP: field of cpsc that focuses on interaction b/w cpus and humans. NLP techniques are used to analyze text, providing a way for cpus to understand human language. A few examples of NLP applications:
- automatic summarization
- topic segmentation
- sentiment analysis

this tutorial uses **Natural Language Toolkit (NLTK)**: an NLP tool for Python

**PREREQS**
- have python 3 installed
- local programming environment setup

**STEP 1: IMPORT NLTK**

In [1]:
# NLTK import statement (doesn't really work)
from nltk.corpus import twitter_samples

**STEP 2: DOWNLOAD NLTK'S DATA AND TAGGER**

We will use a twitter corpus that we can download thru NLTK. Specifically, work w/ NLTK's "twitter_samples" corpus.

In [2]:
# real import statement
import nltk

# download twitter_samples from nltk.corpus
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\liuco\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

NLTK's twitter corpus contains sample of 20k tweets retrieved from **twitter streaming API**. Full tweets are stored as line-separated **JSON**. 

In [3]:
# see how many JSON files exist in corpus
twitter_samples.fileids()

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']

In [4]:
# use file IDs to return tweet strings

#test: twitter_samples.strings('positive_tweets.json')

Now, we know our **corpus** was downloaded successfully. We have access to twitter_samples corpus, so now we can begin writing a script to process tweets.

The goal of our script will be to count how many adjectives and nouns appear in the positive subset of the twitter_samples corpus: **nouns and adjectives**

Later, we could extend script to count positive adjectives (great, happy, etc) vs. negative adjectives (boring, lame, sad, etc.) which could be used to **analyze sentiment** of tweets/reviews of a product/movie. For example, this script provides data that can in turn inform decisions related to that product or movie.

**STEP 3: TOKENIZING SENTENCES**
1. create sript that we'll be working in + call it nlp.py
2. let's import corpus
3. create tweets variable
4. assign it to list of tweet strings from positive_tweets.json file

In [5]:
# import statement
import nltk

# download twitter_samples
nltk.download('twitter_samples')

#set tweets to list of tweet strings from positive_tweets.json file
tweets = twitter_samples.strings('positive_tweets.json')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\liuco\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


When first load tweets, each tweet = 1 string. Before we can determine which words in tweets are adjs/nouns, need to **tokenize** tweets.

**TOKENIZATION**: act of breaking up sequence of strings into pieces such as words, keywords, phrases, symbols, and other elements, which are called **tokens**.

In [6]:
# create new variable: tweets_tokens
tweets_tokens = twitter_samples.tokenized('positive_tweets.json')

tweets_tokens: is a new variable that is a list where each element in list is a list of tokens. Now that we have tokens of each tweet, we can tag tokens w/ appropriate POS tags

**STEP 4: TAGGING SENTENCES**
To access NLTK's POS tagger, need to import it

In [7]:
# import statement 
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag_sents

# tag each of our tokens by creating new variable, tweets_tagged, which we will use to store tagged lists
tweets_tagged = pos_tag_sents(tweets_tokens)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\liuco\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


We can see that our tweet is represented as a **list** and for each token, we have info about its POS tag. Each token/tag pair is saved as a **tuple**.

in NLTK taggers:
- abbreviation for adjective is **JJ**
- abbreviation for singular nouns is **NN**
- abbreviation for pluaral nouns is **NNS**

In next step, we will count how many times **JJ** and **NN** appear throughout corpus

**STEP 5: COUNTING POS TAGS**

- note: POS = part of sentence

We will keep track of how many times **JJ** and **NN** appear using **accumulator (count) variable** which we will continuously add to every time we find a tag. 

In [8]:
# first, crate a count that is set to 0 initially
JJ_count = 0
NN_count = 0

After we create variables, create 2 for loops. First loop iterates thru each tweet in list. Second loop iterates thru each token/tag pair in each tweet. For each pair, will look up tag using appropriate tuple index.

Next, check to see if tag matches either string **'JJ'** or **'NN'*8 using **conditional statements**. If tag is a match, add +=1 to appropriate accumulator.

In [9]:
# 1st for loop: iterate thru each tweet in list
for tweet in tweets_tagged:
    # 2nd for loop: iterates thru each token/tag pair in each tweet
    for pair in tweet:
        # for each pair, look up tag 
        tag = pair[1]
        if tag == 'JJ':
            JJ_count += 1
        elif tag == 'NN':
            NN_count += 1

After 2 loops complete, should have total count for adjectives and nouns in our corpus, add print statements to see how many adjectives/nouns found

In [10]:
print('Total # adjectives = ', JJ_count)
print('Total # nouns = ', NN_count)

Total # adjectives =  6094
Total # nouns =  13180


**user-friendly view of all code written**

In [11]:
# Import data and tagger
from nltk.corpus import twitter_samples
from nltk.tag import pos_tag_sents

# Load tokenized tweets
tweets_tokens = twitter_samples.tokenized('positive_tweets.json')

# Tag tagged tweets
tweets_tagged = pos_tag_sents(tweets_tokens)

# Set accumulators
JJ_count = 0
NN_count = 0

# Loop through list of tweets
for tweet in tweets_tagged:
    for pair in tweet:
        tag = pair[1]
        if tag == 'JJ':
            JJ_count += 1
        elif tag == 'NN':
            NN_count += 1

# Print total numbers for each adjectives and nouns
print('Total number of adjectives = ', JJ_count)
print('Total number of nouns = ', NN_count)

Total number of adjectives =  6094
Total number of nouns =  13180


**CONCLUSION**

- learned some NLP techniques to analyze test using NLTK library in python
- able to download corpora, tokenize, tag, and count POS tags in python
- maybe try now w/ real twitter data or reading in own data (plain text files?)