<a href="https://colab.research.google.com/github/darpan02-cypher/Knowledge-Data-and-Discovery/blob/main/Word_association_minning_HW6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [12]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Specify the path to your folder in Google Drive
folder_path = '/content/drive/MyDrive/KDD/MovieReviews' # Replace with the actual path to your folder

# List files in the folder
try:
    files = os.listdir(folder_path)
    print("Files found the folder:")

except FileNotFoundError:
    print(f"Error: Folder not found at {folder_path}")
except Exception as e:
    print(f"An error occurred: {e}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Files found the folder:


# Task
Calculate the entropy of the word "director" over its distribution across the 180 documents located in the folder "/content/drive/MyDrive/document_corpus".

## Install necessary libraries

### Subtask:
Install `PyShortTextCategorization`, `scipy`, and `numpy`.


In [13]:
!pip install -U git+https://github.com/stephenhky/PyShortTextCategorization

Collecting git+https://github.com/stephenhky/PyShortTextCategorization
  Cloning https://github.com/stephenhky/PyShortTextCategorization to /tmp/pip-req-build-115vn1gn
  Running command git clone --filter=blob:none --quiet https://github.com/stephenhky/PyShortTextCategorization /tmp/pip-req-build-115vn1gn
  Resolved https://github.com/stephenhky/PyShortTextCategorization to commit 281f86e4b19f311d70a541179894815290f24c5a
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [14]:
!pip install scipy numpy



## Load documents

### Subtask:
Load the documents from the specified folder in Google Drive.


In [15]:
import os

folder_path = '/content/drive/MyDrive/KDD/MovieReviews'
documents = []

try:
    files = os.listdir(folder_path)
    print(f"Files found in the folder: {len(files)}")

    for file_name in files:
        file_path = os.path.join(folder_path, file_name)
        with open(file_path, 'r', encoding='latin-1') as f:
            content = f.read()
            documents.append(content)

    print(f"Successfully loaded {len(documents)} documents.")

except FileNotFoundError:
    print(f"Error: Folder not found at {folder_path}")
except Exception as e:
    print(f"An error occurred: {e}")

Files found in the folder: 180
Successfully loaded 180 documents.


## Count word occurrences

### Subtask:
Count the occurrences of the word "director" in all the documents.


**Reasoning**:
Count the occurrences of the word "director" in each document and sum them up.



In [16]:
director_count = 0
for doc in documents:
    director_count += doc.lower().count("director")

print(f"Total occurrences of 'director': {director_count}")

Total occurrences of 'director': 292


## Calculate probability distribution

### Subtask:
Calculate the probability distribution of the word "director" across the documents.


**Reasoning**:
Calculate the probability distribution of the word "director" across the documents by dividing the total count of "director" by the total number of documents and calculating the complement probability.



In [17]:
total_documents = len(documents)
p = director_count / total_documents
one_minus_p = 1 - p

print(f"Total documents: {total_documents}")
print(f"Probability of 'director' occurring (p): {p}")
print(f"Probability of 'director' not occurring (1-p): {one_minus_p}")

Total documents: 180
Probability of 'director' occurring (p): 1.6222222222222222
Probability of 'director' not occurring (1-p): -0.6222222222222222


## Calculate probability distribution

### Subtask:
Calculate the probability distribution of the word "director" across the documents.


In [18]:
import numpy as np

director_probabilities = []
for doc in documents:
    if "director" in doc.lower():
        director_probabilities.append(1)
    else:
        director_probabilities.append(0)

director_probabilities = np.array(director_probabilities)
p = np.mean(director_probabilities)
one_minus_p = 1 - p

print(f"Probability of 'director' occurring (p): {p}")
print(f"Probability of 'director' not occurring (1-p): {one_minus_p}")

Probability of 'director' occurring (p): 0.9388888888888889
Probability of 'director' not occurring (1-p): 0.061111111111111116


## Calculate entropy

### Subtask:
Compute the entropy of the word "director" using the calculated probability distribution.


In [19]:
from scipy.stats import entropy

probabilities = [p, one_minus_p]
entropy_value = entropy(probabilities, base=2)

entropy = entropy_value

print(f"Entropy of 'director': {entropy}")

Entropy of 'director': 0.3318399155702335


## Summary:

### Data Analysis Key Findings

*   180 documents were successfully loaded from the path `/content/drive/MyDrive/KDD/MovieReviews` using 'latin-1' encoding.
*   The word "director" (case-insensitive) appeared a total of 292 times across all documents.
*   The probability of the word "director" occurring in a document is approximately 0.939.
*   The probability of the word "director" not occurring in a document is approximately 0.061.
*   The calculated entropy of the word "director" over its distribution across the documents is approximately 0.3318.

### Insights or Next Steps

*   The initial attempt to calculate the probability distribution based on total occurrences led to an incorrect probability greater than 1, highlighting the importance of correctly defining the event for probability calculation (occurrence in a document vs. total count).
*   The low entropy value suggests that the word "director" is relatively predictable in this corpus, frequently appearing in documents.


#Correct result of mutual information between the “director” and the document author.

To calculate the mutual information between the word "director" and the document author, we need the following:

1.  A list of documents.
2.  A list of authors, where each author in the list corresponds to the author of the document at the same index in the document list.

Assuming we have these two lists, we can use the following code to calculate the mutual information.

In [22]:
import numpy as np
from collections import defaultdict


# Based on the  information, the first 80 reviews are by Berardinelli and the next 100 by Schwartz.
document_authors = ["Berardinelli"] * 80 + ["Schwartz"] * 100

# Ensure the number of documents and authors are the same
if len(documents) != len(document_authors):
    raise ValueError("The number of documents and authors must be the same.")

# Create a joint probability distribution of (word_occurrence, author)
joint_prob_dist = defaultdict(int)
word_occurrences = [1 if "director" in doc.lower() else 0 for doc in documents]

for word_occurrence, author in zip(word_occurrences, document_authors):
    joint_prob_dist[(word_occurrence, author)] += 1

# Normalize the joint probability distribution
total_pairs = len(documents)
for key in joint_prob_dist:
    joint_prob_dist[key] /= total_pairs

# Calculate marginal probability distribution of word occurrence
p_word = {0: 0, 1: 0}
for word_occurrence in word_occurrences:
    p_word[word_occurrence] += 1
for key in p_word:
    p_word[key] /= total_pairs

# Calculate marginal probability distribution of author
p_author = defaultdict(int)
for author in document_authors:
    p_author[author] += 1
for key in p_author:
    p_author[key] /= total_pairs

# Calculate mutual information
mutual_information = 0
for (word_occurrence, author), p_xy in joint_prob_dist.items():
    if p_xy > 0:
        p_x = p_word[word_occurrence]
        p_y = p_author[author]
        if p_x > 0 and p_y > 0:
            mutual_information += p_xy * np.log2(p_xy / (p_x * p_y))

print(f"Mutual Information between 'director' and document author: {mutual_information}")

Mutual Information between 'director' and document author: 0.0751048474130945


#Find the top ten words with the highest mutual information with the document author andtheir respective mutual information. Explain (in Python comments) that what it means for word by having a high mutual information with the document author.


# Task
Find the top ten words with the highest mutual information with the document author and their respective mutual information using the corpus of 180 movie reviews, where the first 80 reviews are by Berardinelli and the remaining 100 are by Schwartz. Explain what it means for a word to have a high mutual information with the document author.

## Tokenize documents

### Subtask:
Split the documents into individual words.


**Reasoning**:
The subtask requires tokenizing the documents into individual words. I will use `nltk` for this purpose, which requires downloading the 'punkt' tokenizer data first. Then I will iterate through the documents, convert them to lowercase, tokenize them, and store the tokens in a list.



In [23]:
import nltk
from nltk.tokenize import word_tokenize

# Download the 'punkt' tokenizer data if not already present
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    nltk.download('punkt')

tokenized_documents = []
for doc in documents:
    # Convert to lowercase and tokenize
    tokens = word_tokenize(doc.lower())
    tokenized_documents.append(tokens)

print(f"Successfully tokenized {len(tokenized_documents)} documents.")

Successfully tokenized 180 documents.


## Calculate mutual information for each word

### Subtask:
For each unique word, calculate its mutual information with the document author.


**Reasoning**:
Calculate the mutual information for each unique word across all documents.



In [24]:
from collections import defaultdict
import numpy as np
from scipy.stats import entropy

# Create a set of all unique words across all tokenized_documents.
all_words = set()
for doc_tokens in tokenized_documents:
    all_words.update(doc_tokens)

# Initialize an empty dictionary to store the mutual information for each word.
word_mutual_information = {}

# Calculate marginal probability distribution of author (reused from previous calculation)
p_author = defaultdict(int)
for author in document_authors:
    p_author[author] += 1
total_pairs = len(documents)
for key in p_author:
    p_author[key] /= total_pairs

# Iterate through each unique word.
for word in all_words:
    # For each word, create a list indicating its occurrence (1 if present, 0 if not) in each document.
    word_occurrences = [1 if word in doc_tokens else 0 for doc_tokens in tokenized_documents]

    # Calculate the joint probability distribution of the word occurrence and the document author.
    joint_prob_dist = defaultdict(int)
    for word_occurrence, author in zip(word_occurrences, document_authors):
        joint_prob_dist[(word_occurrence, author)] += 1

    # Normalize the joint probability distribution
    for key in joint_prob_dist:
        joint_prob_dist[key] /= total_pairs

    # Calculate marginal probability distribution of word occurrence
    p_word = {0: 0, 1: 0}
    for word_occurrence in word_occurrences:
        p_word[word_occurrence] += 1
    for key in p_word:
        p_word[key] /= total_pairs

    # Compute the mutual information for the current word
    mutual_information = 0
    for (word_occurrence, author), p_xy in joint_prob_dist.items():
        if p_xy > 0:
            p_x = p_word[word_occurrence]
            p_y = p_author[author]
            if p_x > 0 and p_y > 0:
                mutual_information += p_xy * np.log2(p_xy / (p_x * p_y))

    # Store the mutual information in the dictionary.
    word_mutual_information[word] = mutual_information

print(f"Calculated mutual information for {len(word_mutual_information)} unique words.")

Calculated mutual information for 15760 unique words.


## Rank words by mutual information

### Subtask:
Sort the words based on their calculated mutual information in descending order.


**Reasoning**:
Sort the word_mutual_information dictionary items by mutual information in descending order and store them in a new list.



In [25]:
# Sort the word_mutual_information dictionary by values (mutual information) in descending order
sorted_word_mutual_information = sorted(word_mutual_information.items(), key=lambda item: item[1], reverse=True)

print(f"Successfully sorted words by mutual information.")

Successfully sorted words by mutual information.


## Select top ten words

### Subtask:
Get the top ten words with the highest mutual information.


**Reasoning**:
Select the first 10 elements from the sorted list and store them in a new variable.



In [26]:
top_ten_words = sorted_word_mutual_information[:10]

print("Top ten words with the highest mutual information with document author:")
for word, mi in top_ten_words:
    print(f"Word: '{word}', Mutual Information: {mi}")

# Explain the meaning of high mutual information
print("\nExplanation:")
print("A word having high mutual information with the document author indicates that the word's presence or absence in a document is strongly associated with a particular author.")
print("In other words, the word is more likely to appear in documents written by one author compared to the other.")
print("This suggests that the word is a distinguishing feature of that author's writing style or the topics they tend to cover.")

Top ten words with the highest mutual information with document author:
Word: 'schwartz', Mutual Information: 0.991076059838222
Word: 'rights', Mutual Information: 0.9031044927571723
Word: 'reserved', Mutual Information: 0.869351835367191
Word: 'dennis', Mutual Information: 0.8062750920498906
Word: 'reviewed', Mutual Information: 0.7221798550488772
Word: '©', Mutual Information: 0.6302403387276296
Word: 'reviews', Mutual Information: 0.5237955538568037
Word: 'cast', Mutual Information: 0.49374267516494463
Word: ';', Mutual Information: 0.47696373366128547
Word: '--', Mutual Information: 0.4713735184568902

Explanation:
A word having high mutual information with the document author indicates that the word's presence or absence in a document is strongly associated with a particular author.
In other words, the word is more likely to appear in documents written by one author compared to the other.
This suggests that the word is a distinguishing feature of that author's writing style or the

## Display results and explanation

### Subtask:
Print the top ten words and their mutual information values, and explain what high mutual information means in this context.


**Reasoning**:
Print the top ten words and their mutual information values, and explain what high mutual information means in this context.



In [27]:
# Print the top ten words and their mutual information
print("Top ten words with the highest mutual information with document author:")
for word, mi in top_ten_words:
    print(f"Word: '{word}', Mutual Information: {mi:.4f}")

# Explain the meaning of high mutual information
print("\nExplanation:")
print("A word having high mutual information with the document author indicates that the word's presence or absence in a document is strongly associated with a particular author.")
print("In other words, the word is more likely to appear in documents written by one author compared to the other.")
print("This suggests that the word is a distinguishing feature of that author's writing style or the topics they tend to cover.")

Top ten words with the highest mutual information with document author:
Word: 'schwartz', Mutual Information: 0.9911
Word: 'rights', Mutual Information: 0.9031
Word: 'reserved', Mutual Information: 0.8694
Word: 'dennis', Mutual Information: 0.8063
Word: 'reviewed', Mutual Information: 0.7222
Word: '©', Mutual Information: 0.6302
Word: 'reviews', Mutual Information: 0.5238
Word: 'cast', Mutual Information: 0.4937
Word: ';', Mutual Information: 0.4770
Word: '--', Mutual Information: 0.4714

Explanation:
A word having high mutual information with the document author indicates that the word's presence or absence in a document is strongly associated with a particular author.
In other words, the word is more likely to appear in documents written by one author compared to the other.
This suggests that the word is a distinguishing feature of that author's writing style or the topics they tend to cover.


## Summary:

### Q&A
What does it mean for a word to have a high mutual information with the document author?
A word having high mutual information with the document author indicates that the word's presence or absence in a document is strongly associated with a particular author. This suggests that the word is a distinguishing feature of that author's writing style or the topics they tend to cover, as the word is more likely to appear in documents written by one author compared to the other.

### Data Analysis Key Findings
* The top ten words with the highest mutual information with the document author are: 'schwartz' (0.9911), 'rights' (0.9031), 'reserved' (0.8694), 'dennis' (0.8063), 'reviewed' (0.7222), '©' (0.6302), 'reviews' (0.5238), 'cast' (0.4937), ';' (0.4770), and '--' (0.4714).
* The word 'schwartz' has the highest mutual information, indicating a strong association with the author Schwartz.
* Words like 'rights', 'reserved', and 'dennis' also show high mutual information, suggesting they are highly indicative of a specific author (likely Schwartz and Berardinelli, respectively, based on the names).

