This notebook implements the method from Turney and Littman (2003) to determine the semantic orientation (SO) of words. The core idea is that a word's sentiment (positive or negative) can be calculated by measuring its similarity to a set of pre-defined "positive" and "negative" seed words. We will use pre-trained GloVe word embeddings to measure this similarity.

#### Cell 1: Importing Necessary Libraries
This cell imports the libraries required for the script. These include modules for handling collections (`defaultdict`), sorting (`operator`), and interacting with word embeddings using the `gensim` library.

In [None]:
# Import the regular expression module, although it's not used in the final code.
import re, operator
# Import defaultdict, a dictionary subclass that calls a factory function to supply missing values.
from collections import defaultdict
# Import Word2Vec and KeyedVectors for loading and working with word embeddings.
from gensim.models import Word2Vec, KeyedVectors
# Import a utility to convert GloVe format embeddings to the Word2Vec format.
from gensim.scripts.glove2word2vec import glove2word2vec
# Import a utility for finding the path to test data, though not directly used here.
from gensim.test.utils import datapath

#### Cell 2: Defining Seed Word Dictionaries
Here, we define the initial sets of positive and negative words as specified by Turney and Littman. The semantic orientation of any given word will be calculated based on its aggregate similarity to these two sets.

In [None]:
# Create a set of positive seed words. Using a set provides fast membership testing.
turney_littman_positive=set(["good", "nice", "excellent", "positive", "fortunate", "correct", "superior"])
# Create a set of negative seed words.
turney_littman_negative=set(["bad", "nasty", "poor", "negative", "unfortunate", "wrong", "inferior"])

#### Cell 3: Instructions for Data Download
This markdown cell provides instructions for the user to download the pre-trained GloVe word vectors required for the analysis. These vectors represent words as dense numerical vectors, capturing semantic relationships between them.

Download the Glove vectors [glove.42B.300d.50K.txt](https://drive.google.com/file/d/1n1jt0UIdI3CD26cY1EIeks39XH5S8O8M/view?usp=sharing) and [glove.twitter.27B.100d.50K.txt](https://drive.google.com/file/d/1Tk4S5u6mwwZwEd5H7bimNXzHnbqWw7_y/view?usp=sharing) and store them in the data/ directory.  These are word vectors from [Glove](https://nlp.stanford.edu/projects/glove/), with the most frequent 50K words from each source.

#### Cell 4: GloVe to Word2Vec Conversion Function
The `gensim` library's `KeyedVectors.load_word2vec_format` function expects embeddings to be in the Word2Vec format. Since the original files are in the GloVe format, this helper function, `read_glove`, first converts the GloVe file into a temporary Word2Vec-formatted file and then loads it.

In [None]:
# Define a function to read a GloVe embedding file.
def read_glove(filename):

    # Define the output filename for the Word2Vec format.
    glove_in_w2v_format="%s.w2v" % filename
    # Convert the GloVe file to Word2Vec format and save it. The underscore `_` is used to discard the return value.
    _ = glove2word2vec(filename, glove_in_w2v_format)
    
    # Load the newly created Word2Vec format file into a KeyedVectors object.
    glove = KeyedVectors.load_word2vec_format(glove_in_w2v_format, binary=False)
    # Return the loaded embeddings object.
    return glove

#### Cell 5: Loading Common Crawl Vectors
This cell calls the `read_glove` function to load the GloVe word embeddings that were pre-trained on the Common Crawl dataset. These vectors have 300 dimensions.

In [None]:
# Load the Common Crawl GloVe vectors using the previously defined function.
common_crawl_vectors=read_glove("../data/glove.42B.300d.50K.txt")

#### Cell 6: Loading Twitter Vectors
Similarly, this cell loads the GloVe word embeddings pre-trained on a large corpus of Twitter data. These vectors have 100 dimensions and may capture different semantic nuances compared to the Common Crawl vectors.

In [None]:
# Load the Twitter GloVe vectors.
twitter_vectors=read_glove("../data/glove.twitter.27B.100d.50K.txt")

#### Cell 7: Question 1
This markdown cell poses the first task: to implement the Turney and Littman (2003) algorithm for calculating a word's semantic orientation.

Q1: Implement the [Turney and Littman (2003)](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.9.6425&rep=rep1&type=pdf) method for calculating the semantic orientation of a term using its neighbors (you can find the method on p. 6).  Use word embeddings to calculate $A$(word1, word2) -- the measure of association between word 1 and word 2.  The arguments to `semantic_orientation` function are a set of word embeddings, the query term, and a positive and negative dictionary; it should return the semantic orientation for the query term (a real value).

#### Cell 8: Implementing the Semantic Orientation Formula
This function implements the core logic of the Turney and Littman paper. The Semantic Orientation (SO) of a `word` is calculated as the sum of its similarities to all words in the `positive_dictionary` minus the sum of its similarities to all words in the `negative_dictionary`.

$SO(word) = \sum_{p_i \in P} A(word, p_i) - \sum_{n_j \in N} A(word, n_j)$

Here, $A(word1, word2)$ is the association (similarity) between two words, which we measure using the cosine similarity from the word embeddings.

In [None]:
# Define the function to calculate the semantic orientation of a single word.
def semantic_orientation(embeddings, word, positive_dictionary, negative_dictionary):
    # Initialize the Semantic Orientation (SO) score to zero.
    SO=0.
    # Iterate through each word in the positive dictionary.
    for query in positive_dictionary:
        # Calculate the cosine similarity between the input word and the positive query word.
        similarity=embeddings.similarity(word, query)
        # Add this similarity score to the total SO score.
        SO+=similarity
    # Iterate through each word in the negative dictionary.
    for query in negative_dictionary:
        # Calculate the cosine similarity between the input word and the negative query word.
        similarity=embeddings.similarity(word, query)
        # SUBTRACT this similarity score from the total SO score.
        SO-=similarity
    # Return the final calculated semantic orientation score.
    return SO

#### Cell 9: Function to Calculate and Display SO for an Entire Vocabulary
This helper function iterates through every word in the provided `glove` vocabulary, calculates its semantic orientation using the `semantic_orientation` function, and stores the result. It then sorts all words by their SO score and prints the 25 most positive and 25 most negative words.

In [None]:
# Define a function to find and display the SO for all words in an embeddings vocabulary.
def find_all_semantic_orientation(glove, positive, negative):
    # Create a defaultdict to store the SO scores for each word.
    scores=defaultdict(float)
    # Iterate through every word in the vocabulary of the provided embeddings.
    for word in glove.vocab:
        # Calculate and store the semantic orientation for the current word.
        scores[word]=semantic_orientation(glove, word, positive, negative)
    
    # Sort the dictionary of scores by value (the SO score) in ascending order.
    sorted_x = sorted(scores.items(), key=operator.itemgetter(1))
        
    # Print the top 25 most positive words (the last 25 items of the sorted list, reversed for descending order).
    for k,v in reversed(sorted_x[-25:]):
        print("%.3f\t%s" % (v,k))
    # Print a blank line for separation.
    print()

    # Print the top 25 most negative words (the first 25 items of the sorted list).
    for k,v in sorted_x[:25]:
        print("%.3f\t%s" % (v,k))

#### Cell 10: Running Analysis on Common Crawl Vectors
This cell executes the analysis using the `common_crawl_vectors` and the original Turney & Littman positive/negative dictionaries. The output will show the words that the model considers most positive and most negative based on this dataset.

In [None]:
# Call the function to find and print the most positive and negative words from the Common Crawl vectors.
find_all_semantic_orientation(common_crawl_vectors, turney_littman_positive, turney_littman_negative)

#### Cell 11: Running Analysis on Twitter Vectors
This cell performs the same analysis but uses the `twitter_vectors`. Comparing the results from this run with the previous one can reveal how the source of the training data (Common Crawl vs. Twitter) affects the learned semantic associations.

In [None]:
# Call the function again, this time using the Twitter vectors.
find_all_semantic_orientation(twitter_vectors, turney_littman_positive, turney_littman_negative)

#### Cell 12: Question 2
This markdown cell presents the second task. The user is asked to create their own custom positive and negative dictionaries (with at least 10 words each) relevant to a specific binary classification problem they have worked on. The goal is to use the `find_all_semantic_orientation` function to discover new, related terms for their dictionaries.

Q2: In homework 4.classificadtion/FeatureExploration, we created two dictionaries to use for binary text classification.  For the binary classification problem you have been working on, create two new dictionaries containing at least 10 terms each that you think will help in discriminating between the two classes.

Execute this method on that pair of dictionaries to discover new terms to fill it.  How many of the top 10 terms are you able to select as appropriate for the dictionaries?

#### Cell 13: Placeholder for Custom Dictionaries
These are empty sets intended for the user to populate with their own positive and negative words for the task described in Question 2.

In [None]:
# A placeholder set for the user's custom positive dictionary.
positive_class_dictionary=set([""])
# A placeholder set for the user's custom negative dictionary.
negative_class_dictionary=set([""])

#### Cell 14: Running Analysis with Custom Dictionaries
This cell calls the analysis function using the Common Crawl vectors but with the user-defined custom dictionaries. This will generate lists of the most positive and negative words according to the user's specific domain, which can be used to expand the initial dictionaries.

In [None]:
# Execute the SO analysis using the Common Crawl vectors and the user's custom dictionaries.
find_all_semantic_orientation(common_crawl_vectors, positive_class_dictionary, negative_class_dictionary)