# MSDS Network Analysis, Lab 2: Build a Semantic Network

## ⚡️ Make a Copy

Save a copy of this notebook in your Google Drive before continuing. Be sure to edit your own copy, not the original notebook.

## 📓 About this lab

In this lab, you will build a semantic network of Tweets. That is, a graph of Tweets related by natural language features of the Tweet texts.


# Imports

In [1]:
import gzip
import re
import itertools
import json
import networkx as nx
#import matplotlib.pyplot as plt
import nltk
import string

In [2]:
print(nltk.__version__)

3.4.5


In [3]:
import sys

print("Python version")
print(sys.version)

Python version
3.8.19 | packaged by conda-forge | (default, Mar 20 2024, 12:38:07) [MSC v.1929 64 bit (AMD64)]


## Get the data

Be sure you still have the brand Tweets file on your Google Drive from the previous Lab.

In [4]:
DATA_FILE = "nikelululemonadidas_tweets.jsonl.gz"

## Mount Google Drive

In [5]:
# from google.colab import drive
# drive.mount('/content/drive')

### About the NLTK and NLP in Python

The [NLTK (Natural Language Toolkit)](https://www.nltk.org/) is an old standby for natural language processing (NLP) in the Python world.

There are a good number of NLP-related Python packages, but many of them are in fact built on the NLTK, so it is worth getting some foundational exposure to that package.

If you want to explore NLP with Python in more depth, popular libraries include [spaCy](https://spacy.io/) and [TextBlob](https://textblob.readthedocs.io/en/dev/) both of which are (like the NLTK) broadly scoped generalist NLP libraries.

There are also a good number of specialized libraries for a number of tasks, including keyword extraction, fuzzy string matching, natural language data handling, the list goes on. If you have a task at hand, it's worth doing a quick search to see if the problem has already been solved.

### NLTK downloads

A lot of NLTK tools fall into the category of corpus linguistics, and the algorithms for these tools often require large amounts of backing data that can unnecessarily bloat the package size if that tool is not being used.

To help manage bloat, the NLTK distributes its supplemental data via a download mechanism that is used on an as-needed basis.

For this assignment, you will be using the `punkt`, `stopwords` and `wordnet` datasets, which are downloaded here.

In [6]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dennis\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dennis\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Dennis\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Text processing functions

Here, we define a few functions that will be used to clean up the tweet texts. Take the time to understand what these functions are doing and how they work. You may, however, want to skim ahead to get a sense of how these functions are being used in order to better motivate your understanding.

### Text tokenization




The near-universal preliminary step to natural language processing of free-form text content is the task of tokenization.

Tokenization is the task of breaking a text down into its component parts, which at the sentence level in English, we think of as words -- although tokens also include things like punctuation and sometimes special-case tokens creep in as we will see is the case for Tweets.

#### A super-simple tokenizer

Consider the following sentence (modified from the popular typing exercise in order to demonstrate some specifics)

In [9]:
sentence = "The quickly browned jumping fox and the quick brown foxes jumped quickly over the lazy dogs lazily lying."

The simplest tokenizer you could build for this is probably the split function:

In [10]:
sentence.split()

['The',
 'quickly',
 'browned',
 'jumping',
 'fox',
 'and',
 'the',
 'quick',
 'brown',
 'foxes',
 'jumped',
 'quickly',
 'over',
 'the',
 'lazy',
 'dogs',
 'lazily',
 'lying.']

.. which works quite well in the simplest cases. But real-world text tends to not fit into simple boxes. Thus, we tend to reach for pre-built tokenizers.

The NLTK word tokenizer is a good example:

In [11]:
nltk.tokenize.word_tokenize(sentence)

['The',
 'quickly',
 'browned',
 'jumping',
 'fox',
 'and',
 'the',
 'quick',
 'brown',
 'foxes',
 'jumped',
 'quickly',
 'over',
 'the',
 'lazy',
 'dogs',
 'lazily',
 'lying',
 '.']

A notable difference here is that punctuation is correctly tokenized, unlike with our simple `split` tokenizer.

But consider this example Tweet:

In [12]:
example_tweet = "hope I get a new pair of these @Nike shoes!!!! #nikelife https://www.nike.com/launch/t/womens-air-force-1-reveal-pastel-reveal"
nltk.tokenize.word_tokenize(example_tweet)

['hope',
 'I',
 'get',
 'a',
 'new',
 'pair',
 'of',
 'these',
 '@',
 'Nike',
 'shoes',
 '!',
 '!',
 '!',
 '!',
 '#',
 'nikelife',
 'https',
 ':',
 '//www.nike.com/launch/t/womens-air-force-1-reveal-pastel-reveal']

There are a few problems here, particularly the poor/improper handling of:

 * @mentions
 * #hashtags
 * web URLs

For this reason, NLTK provides a specialized Tweet tokenizer:

In [13]:
nltk.TweetTokenizer().tokenize(example_tweet)

['hope',
 'I',
 'get',
 'a',
 'new',
 'pair',
 'of',
 'these',
 '@Nike',
 'shoes',
 '!',
 '!',
 '!',
 '#nikelife',
 'https://www.nike.com/launch/t/womens-air-force-1-reveal-pastel-reveal']

That's better. You will further work with the Tweet Tokenizer in Homework 1. Here, let's build a simple tokenize function that can use either the word tokenizer or the Tweet tokenizer.

While we are at it, we'll normalize the text to lowercase so that we can think of, e.g. "The" as being the same word as "the".

In [14]:
TWEET_TOKENIZER = nltk.TweetTokenizer().tokenize
WORD_TOKENIZER = nltk.tokenize.word_tokenize

def tokenize(text, lowercase=True, tweet=False):
    """Tokenize the text. By default, also normalizes text to lowercase.
    Optionally uses the Tweet Tokenizer.
    """
    if lowercase:
        text = text.lower()
    if tweet:
        return TWEET_TOKENIZER(text)
    else:
        return WORD_TOKENIZER(text)

### Functions that take tokens

After tokenizing, there are often a number of other preprocessing steps involved in preparing text data for analysis. We often consider mechanisms for text normalization.

We already lowercased the text for one kind of normalization. Another thing often considered are the ideas of stemming and lemmatization. These are both approaches to dealing with variations on word forms, such as pluralization, and conjugation.

Take a look at the following to get a feel for the differences. In this Lab, we will use the lemmatizer which uses more natural normalized word forms.

> ⚠️ **Caveat:** We will be using the lemmatizer in a bit of a naive way here. The WordNet lemmatizer defaults to treating words as nouns unless told otherwise. The result is that we are mainly just handling the differences in pluralization with the way we are lemmatizing. For more sophisticated lemmas, you would need to do part-of-speech tagging. You explored POS tagging in homework 2. Some extra work would be required to get the POS tags from that assignment into the form required by this lemmatizer. For purposes of this lab, we will stick with the default noun assumption.

> As an example of the effects of using POS tagging, see the code snippets below. [This article at machinelearningplus.com](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/) provides a good overview of some different approaches to lemmatizing, including applying parts of speech to WordNet.

### Lemmatizing with POS

The following code snippets demonstrate differences in signaling the part-of-speech to the lemmatizer. The WordNet lemmatizer defaults to treating everything as nouns, which we will simply accept as good enough for the purpose of this lab.

In [15]:
lemmatizer = nltk.WordNetLemmatizer()
print("noun:", lemmatizer.lemmatize("jumping", "n"))
print("verb:", lemmatizer.lemmatize("jumping", "v"))

noun: jumping
verb: jump


In [16]:
print("noun:", lemmatizer.lemmatize("lying", "n"))
print("verb:", lemmatizer.lemmatize("lying", "v"))

noun: lying
verb: lie


In [17]:
STEMMER = nltk.PorterStemmer()

def stem(tokens):
    """Stem the tokens. I.e., remove morphological affixes and
    normalize to standardized stem forms.

    Has the side effective of producing "unnatural" forms due to
    stemming standards. E.g. quickly becomes quickli
    """
    return [ STEMMER.stem(token) for token in tokens ]

In [18]:
print(stem(tokenize(sentence)))

['the', 'quickli', 'brown', 'jump', 'fox', 'and', 'the', 'quick', 'brown', 'fox', 'jump', 'quickli', 'over', 'the', 'lazi', 'dog', 'lazili', 'lie', '.']


In [19]:
LEMMATIZER = nltk.WordNetLemmatizer()

def lemmatize(tokens):
    """Lemmatize the tokens.

    Retains more natural word forms than stemming, but assumes all
    tokens are nouns unless tokens are passed as (word, pos) tuples.
    """
    lemmas = []
    for token in tokens:
        if isinstance(token, str):
            lemmas.append(LEMMATIZER.lemmatize(token)) # treats token like a noun
        else: # assume a tuple of (word, pos)
            lemmas.append(LEMMATIZER.lemmatize(*token))
    return lemmas

In [20]:
lemmatize([ "foxes", "jumping"])

['fox', 'jumping']

In [21]:
lemmatize([ ("fox", "n"), ("jumps", "v") ])

['fox', 'jump']

In [22]:
print(lemmatize(tokenize(sentence)))

['the', 'quickly', 'browned', 'jumping', 'fox', 'and', 'the', 'quick', 'brown', 'fox', 'jumped', 'quickly', 'over', 'the', 'lazy', 'dog', 'lazily', 'lying', '.']


### Removing stopwords

It can be useful to remove so-called stopwords to improve the average salience of the terms we are analyzing.

Stop words tend to be things like articles and conjunctions that usually don't offer a lot of value in an analysis.

The NLTK has a corpus of stopwords, but we'll include the option of passing in a custom list if desired.

In [23]:
def remove_stopwords(tokens, stopwords=None):
    """Remove stopwords, i.e. words that we don't want as part of our
    analysis. Defaults to the default set of nltk english stopwords.
    """
    if stopwords is None:
        stopwords = nltk.corpus.stopwords.words("english")
    return [ token for token in tokens if token not in stopwords ]

In [24]:
tokens = tokenize(sentence)
print(tokens)
print(remove_stopwords(tokens))

['the', 'quickly', 'browned', 'jumping', 'fox', 'and', 'the', 'quick', 'brown', 'foxes', 'jumped', 'quickly', 'over', 'the', 'lazy', 'dogs', 'lazily', 'lying', '.']
['quickly', 'browned', 'jumping', 'fox', 'quick', 'brown', 'foxes', 'jumped', 'quickly', 'lazy', 'dogs', 'lazily', 'lying', '.']


### Removing hyperlinks

Unless your analysis involves looking at what users are linking to (a more difficult and involved task than it might seem), then you might want to simply get those links out of the way.

In [25]:
def remove_links(tokens):
    """Removes http/s links from the tokens.

    This simple implementation assumes links have been kept intact as whole
    tokens. E.g. the way the Tweet Tokenizer works.
    """
    return [ t for t in tokens
            if not t.startswith("http://")
            and not t.startswith("https://")
        ]


In [26]:
print(remove_links(tokenize(example_tweet, tweet=True)))

['hope', 'i', 'get', 'a', 'new', 'pair', 'of', 'these', '@nike', 'shoes', '!', '!', '!', '#nikelife']


### Removing punctuation

Finally, for our purposes of analysis, we are really only interested in words, not punctuation. Here, we simply remove tokens that are punctuation.

Tweets can get pretty messy, so we've gone beyond simply removing punctation tokens and decided to clean out punctuation altogether.

In [27]:
def remove_punctuation(tokens,
                       strip_mentions=False,
                       strip_hashtags=False,
                       strict=False):
    """Remove punctuation from a list of tokens.

    Has some specialized options for dealing with Tweets:

    strip_mentions=True will strip the @ off of @ mentions
    strip_hashtags=True will strip the # from hashtags

    strict=True will remove all punctuation from all tokens, not merely
    just tokens that are punctuation per se.
    """
    tokens = [t for t in tokens if t not in string.punctuation]
    if strip_mentions:
        tokens = [t.lstrip('@') for t in tokens]
    if strip_hashtags:
        tokens = [t.lstrip('#') for t in tokens]
    if strict:
        cleaned = []
        for t in tokens:
            cleaned.append(
                t.translate(str.maketrans('', '', string.punctuation)).strip())
        tokens = [t for t in cleaned if t]
    return tokens

In [28]:
tokens = tokenize(example_tweet, tweet=True)
print(tokens)
print(remove_punctuation(tokens, strip_mentions=True, strip_hashtags=True))

['hope', 'i', 'get', 'a', 'new', 'pair', 'of', 'these', '@nike', 'shoes', '!', '!', '!', '#nikelife', 'https://www.nike.com/launch/t/womens-air-force-1-reveal-pastel-reveal']
['hope', 'i', 'get', 'a', 'new', 'pair', 'of', 'these', 'nike', 'shoes', 'nikelife', 'https://www.nike.com/launch/t/womens-air-force-1-reveal-pastel-reveal']


In [29]:
simple_tokens = example_tweet.split()
print(simple_tokens)
print(remove_punctuation(simple_tokens, strict=True))

['hope', 'I', 'get', 'a', 'new', 'pair', 'of', 'these', '@Nike', 'shoes!!!!', '#nikelife', 'https://www.nike.com/launch/t/womens-air-force-1-reveal-pastel-reveal']
['hope', 'I', 'get', 'a', 'new', 'pair', 'of', 'these', 'Nike', 'shoes', 'nikelife', 'httpswwwnikecomlaunchtwomensairforce1revealpastelreveal']


## Finally working with the data

Data cleanup is a big task and ultimately one of the bigger burdens of any analysis project. But, now that we have a good suite of utilities for handling our Tweets, the remainder of our work goes quickly.

The code below will do the following for each Tweet in the dataset:

 * Tokenize the text using the Tweet Tokenizer
 * Remove hyperlinks
 * Remove stopwords (standard English stopwords)
 * Remove punctuation tokens and strip @ and # from hashtags and mentions (see note below)
 * Lemmatize the remaining word tokens (using default noun part-of-speech for simplicity)

.. and will collect the unique words and their counts into `word_counts`.

> 💡 Since this is a semantic network we are building, it seems useful to, e.g., treat **@Nike** and **Nike** as the same word. Hence, `strip_mentions`, and `strip_hashtags`. In some cases, for example a mentions network, you would probably take a different approach. As you preprocess and prepare data for the task at hand, it is important to be intentional and aware of how you are handling the text with your end goals in mind.

In [30]:
word_counts = {}

with gzip.open(DATA_FILE) as data_file:
    for i, line in enumerate(data_file):
        if i % 10000 == 0:
            print(f"Processed {i} tweets")
        tweet = json.loads(line)
        text = tweet["full_text"]
        tokens = tokenize(text, tweet=True)
        tokens = remove_links(tokens)
        tokens = remove_stopwords(tokens)
        tokens = remove_punctuation(tokens, strip_mentions=True, strip_hashtags=True)
        tokens = lemmatize(tokens)
        for word in tokens:
            if word not in word_counts:
                word_counts[word] = 0
            word_counts[word] += 1

Processed 0 tweets
Processed 10000 tweets
Processed 20000 tweets
Processed 30000 tweets
Processed 40000 tweets
Processed 50000 tweets
Processed 60000 tweets
Processed 70000 tweets
Processed 80000 tweets
Processed 90000 tweets
Processed 100000 tweets
Processed 110000 tweets
Processed 120000 tweets
Processed 130000 tweets
Processed 140000 tweets
Processed 150000 tweets
Processed 160000 tweets
Processed 170000 tweets


In [31]:
len(word_counts)

86527

---

## 🧐 Lab Quiz Question #1

Precisely how many unique words are in the dataset after removing links and stopwords, and punctuation and lemmatizing the remaining tokens? Use the length of `word_counts` to determine your answer.

Be sure to answer this and the remaining lab quiz questions in Lab Quiz 2.

Answer = 86527

---

### NOTE: Only for Question 1, not running the rest as obtained earlier

In [32]:
----------------------------------

SyntaxError: invalid syntax (3398321072.py, line 1)

## Reducing the graph to the most common words

To keep the size of your semantic network managable, reduce the word set to just the top 1000 most popular words.

To do this, you will sort the word counts by reverse value (i.e. by count from highest to lowest) and take a slice of 1000.

In [None]:
sorted_counts = sorted(word_counts.items(), key=lambda item: item[1], reverse=True)
sorted_words = [word for word, count in sorted_counts]

Let's take a look at just a few of the top words:

In [None]:
sorted_words[:10]

Some things to note:

 * There appears to be some punctuation here that made it through. We will leave it as both a thought exercise to consider why these tokens are here, and how you might clean them up.

 * rt is right up there near the top, which is not surprising given that these are Tweets. This is an example of something you might clean up, for example, with a specialize stopword list. This cleanup is included below as a coding exercise.

 * While Nike and Adidas made it to the top 10, Lululemon is not here. Why might that be? (The code snippet below sheds some light) And how would you deal with this if you wanted to include Lululemon in your analysis? (Hint: think about the segmentation work you did in the Topic Modeling course.

In [None]:
print("Nike:", word_counts["nike"])
print("Adidas:", word_counts["adidas"])
print("Lululemon:", word_counts["lululemon"])

---

## 🧐 Lab Quiz Question #2

What is the most common word in the cleaned dataset? Use the sliced inspection of sorted_words above to answer the question.

Answer = Nike

---

## 🛠 Exercise

As mentioned above, there are a lot of "rt" (retweet) instances in the word set. As an exercise in developing a specialized stopword list, complete the code below to remove "rt" during pre-processing.

The code snippet is identical to what we already did above, but this time you need to pass in a custom stop list. The custom stop list needs to include all the words that are already being stopped, plus "rt" as a stopword.

> ⚠️ Important: Due to the way `remove_stopwords` has been implemented, it is not sufficent to simply pass in ["rt"] as your stoplist. You'll want to be sure to include all the standard stopwords too!

In [None]:
# Ensure you have the stopwords downloaded
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords

# Create the custom stopwords list
standard_stopwords = set(stopwords.words('english'))
custom_stopwords = standard_stopwords.union({"rt"})

word_counts = {}



with gzip.open(DATA_FILE) as data_file:
    for i, line in enumerate(data_file):
        if i % 10000 == 0:
            print(f"Processed {i} tweets")
        tweet = json.loads(line)
        text = tweet["full_text"]
        tokens = tokenize(text, tweet=True)
        tokens = remove_links(tokens)
        tokens = remove_stopwords(tokens, stopwords=custom_stopwords)
        tokens = remove_punctuation(tokens, strip_mentions=True, strip_hashtags=True)
        tokens = lemmatize(tokens)
        for word in tokens:
            if word not in word_counts:
                word_counts[word] = 0
            word_counts[word] += 1

In [None]:
sorted_counts = sorted(word_counts.items(), key=lambda item: item[1], reverse=True)
sorted_words = [word for word, count in sorted_counts]

In [None]:
sorted_words[:10]

In [None]:
sorted_counts[:10]

---

## 🧐 Lab Quiz Question #3

After adding "rt" to the stopword list, now what is the most common word in the cleaned dataset? Use the sliced inspection of sorted_words above to answer the question.

💡 Hint: This is not meant to be a trick question so much as to just be sure you are following along with understanding. If you think about it, you could probably have answered this question before even implementing the code changes.

---

## Build and plot the graph

You have now done all the heavy lifting required to build the semantic network.

The code below builds an undirected semantic network of co-occurring words that belong to our network of top n terms. These graphs can get kind of heavy, so start with a small graph of n=20 to keep things manageable.

To do this, we need to:

 * Process each tweet in the same way we did previously
 * Determine which tokens in the Tweet belong to the top N
 * Add all of the 2-combinations (ie. co-occurrences) of included terms as an edge in the graph.

We use the handy [itertools module](https://docs.python.org/3/library/itertools.html) to help us get this last thing done.

In [None]:
N = 20
top_terms = sorted_words[:N]
graph = nx.Graph()

with gzip.open(DATA_FILE) as data_file:
    for i, line in enumerate(data_file):
        if i % 10000 == 0:
            print(f"Processed {i} tweets")
        tweet = json.loads(line)
        text = tweet["full_text"]
        tokens = tokenize(text, tweet=True)
        tokens = remove_links(tokens)
        tokens = remove_stopwords(tokens, stopwords=custom_stopwords)
        tokens = remove_punctuation(tokens, strip_mentions=True, strip_hashtags=True)
        tokens = lemmatize(tokens)

        # reduce the tweet to terms in the 1000 word network and add the
        # term relationships to the graph
        nodes = [t for t in tokens if t in top_terms]
        cooccurrences = itertools.combinations(nodes, 2)
        if i == 0:
            print("Just a glimpse so you can see what the cooccurrences for a tweet look like:")
            cooccurrences = list(cooccurrences)
            print(cooccurrences)
        graph.add_edges_from(cooccurrences)

In [None]:
nx.info(graph)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(300, 300))
nx.draw_networkx(graph, ax=ax, font_color="#FFFFFF", font_size=20, node_size=30000, width=4, arrowsize=100)

## Create a local env

### Open Anaconda Prompt or terminal

conda create --name myenv python=3.8

conda activate myenv

pip install nltk==3.4.5

python -c "import nltk; print(nltk.__version__)"
