<a href="https://colab.research.google.com/github/fengfrankgthb/Demonstrations/blob/main/LIT_CeO2_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SAT & GRE tested w/ all-mpnet-base-v2 Model 2025.05.18

In this project, an well-trained **transformer model** is used to represent an knowledgable super-student in front of SAT and GRE reading quexstions. **Biases** and **confusions** are disclosed, likely attributable to **over-fitting** to the training of the model. Over-fitting is the ML terminology for **test cramming**, a phenominon when fitting specificities of **training dataset** caused the model to not being able to fit to specificities of **testing dataset**.

## 1. Install Necessary Libraries
* **sentence-transformers** This is the text embedding library
* **scikit-learn** This is the machine learning library
* **matplotlib** This is the mat-lab style plotting library

In [34]:
!pip install sentence-transformers scikit-learn matplotlib



## 2. import necessary modules from the libraries



**mpl_toolkits.mplot3d**: matplot 3D plotting lib

**numpy**: Numerical Python, the fundamental python lib

**Axes3D** 3D plotting class

**PCA** Principal Components Analysis for linear dimension reduction.

**TSNE** t-SNE non-linear dimension reduction to creat more scattered effect

In [35]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import numpy as np

# set matplot to inline (static) mode, or notebook for interactive mode)
# even though default is inline mode, be explicit to avoid any confusion
# the interactive notebook mode is often unstable at colab environment
# alternative is set as 'inline' and use 'plotly'.
%matplotlib inline

## 3. Imput text data

Choose one subsection below:

### 3.1 Example 1: CeO2-NPs, All-Mighty vs 5Vs (Section 8)

Used **118HHH Q1 on CeO2-NPs** for illustration, breaking into 9 components:

* Pa = All Sentences combined in Passage
* P1 = 1st sentence in Passage
* P2 = 2nd sentence in Passage
* P3 = 3rd sentence in Passage
* Q? = the Question sentence
* Ax = wrong choice A
* Bv = correct answer B
* Cx = wrong choice C
* Dx = wrong choice D

In [None]:
# Example 1: CeO2 Question
# This is a *** question used to directly compare with 5Vs in section 8.1 and 8.2
sentences = ["Pa: Some fuel additives contain cerium oxide nanoparticles (CeO2-NPs), which can leach into waterways and soils via waste water. In a 2015 study Mael Garaud and colleagues found that CeO2-NPs can accumulate in the bodies of zebra mussels (Dreissena polymorpha). While bioaccumulation of manufactured nanoparticles may be inherently worrisome, it has been hypothesized that CeO2-NPs bioaccumulation in invertebrate like D. polymorpha could serve a valuable proxy role, observing the need for manufacturers to conduct costly and intrusive sampling of vertebrate species--such as rainbow trout (Oncorhynchus mykiss), commonly used in regulatory compliance testing--for manipulative bioaccumulation, as environmental protection laws currently require.",
    "P1: Some fuel additives contain cerium oxide nanoparticles (CeO2-NPs), which can leach into waterways and soils via waste water.",
    "P2: In a 2015 study Mael Garaud and colleagues found that CeO2-NPs can accumulate in the bodies of zebra mussels (Dreissena polymorpha).",
    "P3: While bioaccumulation of manufactured nanoparticles may be inherently worrisome, it has been hypothesized that CeO2-NPs bioaccumulation in invertebrate like D. polymorpha could serve a valuable proxy role, observing the need for manufacturers to conduct costly and intrusive sampling of vertebrate species--such as rainbow trout (Oncorhynchus mykiss), commonly used in regulatory compliance testing--for manipulative bioaccumulation, as environmental protection laws currently require.",
    "Q?: Which finding, if true, would most directly weaken the hypothesis presented in the text?",
    "Ax) When D. polymorpha and O. mykiss are exposed to similar levels of CeO2-NPs, concentrations of CeO2-NPs in animals of both species show little variation from individual to individual.",
    "Bv) The rate of CeO2-NPs uptake in D. polymorpha differs from the rate of CeO2-NPs uptake in O. mykiss in a way that is not yet well understood by researchers.",
    "Cx) D.polymorpha has been shown to accumulate several other types of manufactured nanoparticles in addition to CeO2-NPs, whereas O. mykiss has been shown to accumulate only CeO2-NPs.",
    "Dx) Compared with O. mykiss, D.polymorpha can accumulate detectable CeO2-NPs concentrations with significantly fewer negative effects."
]

### 3.2 Examples 2-3-4-5 & Alternatives:

* **Queen Egypt**: demonstrates **omis bias**.
* **LNH Study**: demonstrates **perif bias**, **long-comp confusion**, **num-cont confusion** (V3), and **mul-rel confusion**(V4). In the end, **num-cont confusion** (V3), and **mul-rel confusion**(V4) of 5Vs are combined to demonstrate the clear solution effect.
* **Cotton Mather**; GRE reading question demonstrates **long-pass confusion** and **spec-confusion** in **graduate level** texts.

In [None]:
# Example 2: Queen Egypt
# This is ** question used to discover omis bias among choices
sentences = ["Pa: Archaeologist Christiana Kohler and her team excavated the Egyptian tomb of Queen Merneith, the wife of a First Dynasty pharaoh. Some scholars claim that she also ruled Egypt on her own and was actually the first female pharaoh. The team found a tablet in Merneith’s tomb with writing suggesting that she was in charge of the country’s treasury and other central offices. Whether Merneith was a pharaoh or not, this discovery supports the idea that Merneith likely _______",
    "P1: Archaeologist Christiana Kohler and her team excavated the Egyptian tomb of Queen Merneith, the wife of a First Dynasty pharaoh.",
    "P2: Some scholars claim that she also ruled Egypt on her own and was actually the first female pharaoh.",
    "P3: The team found a tablet in Merneith’s tomb with writing suggesting that she was in charge of the country’s treasury and other central offices.",
    "Q?: Whether Merneith was a pharaoh or not, this discovery supports the idea that Merneith likely _______ (choose from Av, Bx, Cx, and Dx). ",
    "Av) had an important role in Egypt’s government.",
    "Bx) lived after rather than before the First Dynasty of Egypt.",
    "Cx) traveled beyond Egypt’s borders often.",
    "Dx) created a new form of writing in Egypt."
]


In [None]:
# Example 2 (Alt-1): Queen Egypt amended
# This is the revised the example 2 to solve omis bias among choices.
sentences = ["Pa: Archaeologist Christiana Kohler and her team excavated the Egyptian tomb of Queen Merneith, the wife of a First Dynasty pharaoh. Some scholars claim that she also ruled Egypt on her own and was actually the first female pharaoh. The team found a tablet in Merneith’s tomb with writing suggesting that she was in charge of the country’s treasury and other central offices. Whether Merneith was a pharaoh or not, this discovery supports the idea that Merneith likely _______",
    "P1: Archaeologist Christiana Kohler and her team excavated the Egyptian tomb of Queen Merneith, the wife of a First Dynasty pharaoh.",
    "P2: Some scholars claim that she also ruled Egypt on her own and was actually the first female pharaoh.",
    "P3: The team found a tablet in Merneith’s tomb with writing suggesting that she was in charge of the country’s treasury and other central offices.",
    "Q?: This discovery supports that _______ (choose from Av, Bx, Cx, and Dx)?",
    "Av) Merneith likely had a role in her government",
    "Bx) Merneith likely lived after the First Dynasty",
    "Cx) Merneith likely traveld beyond Egypt borders",
    "Dx) Merneith likely created an Egptian writing."
]


In [None]:
# Example 3: LNH Study
# This is a *** question used to discover perif bias, long-comp confusion, num-cont confusion, and mul-rel confusion in text.
sentences = ["Pa: The linguistic niche hypothesis (LNH) posits that the exotericity of languages (how prevalent non-native speakers are) and grammatical complexity are inversely related, which the LNH ascribes to attrition of complex grammatical rules as more non-native speakers adopt the language but fail to acquire those rules. Focusing on two characteristics that are positive indices of grammatical complexity, fusion (when new phonemes arise from the merger of previously distinct ones) and informativity (languages’ capacity for meaningful variation), Olena Shcherbakova and colleagues conducted a quantitative analysis for more than 1,300 languages and claim the outcome is inconsistent with the LNH.",
    "P1: The linguistic niche hypothesis (LNH) posits that the exotericity of languages (how prevalent non-native speakers are) and grammatical complexity are inversely related, which the LNH ascribes to attrition of complex grammatical rules as more non-native speakers adopt the language but fail to acquire those rules.",
    "P2: Focusing on two characteristics that are positive indices of grammatical complexity, fusion (when new phonemes arise from the merger of previously distinct ones) and informativity (languages’ capacity for meaningful variation), Olena Shcherbakova and colleagues conducted a quantitative analysis for more than 1,300 languages and claim the outcome is inconsistent with the LNH.",
    "Pa: The linguistic niche hypothesis (LNH) posits that the exotericity of languages (how prevalent non-native speakers are) and grammatical complexity are inversely related, which the LNH ascribes to attrition of complex grammatical rules as more non-native speakers adopt the language but fail to acquire those rules. Focusing on two characteristics that are positive indices of grammatical complexity, fusion (when new phonemes arise from the merger of previously distinct ones) and informativity (languages’ capacity for meaningful variation), Olena Shcherbakova and colleagues conducted a quantitative analysis for more than 1,300 languages and claim the outcome is inconsistent with the LNH.",
    "Q?: Which finding, if true, would most directly support Shcherbakova and colleagues’ claim?",
    "Ax) Shcherbakova and colleagues’ analysis showed a slightly negative correlation between grammatical complexity and fusion and between grammatical complexity and informativity",
    "Bx) Shcherbakova and colleagues’ analysis showed a slightly negative correlation between grammatical complexity and exotericity.",
    "Cx) Shcherbakova and colleagues’ analysis showed a slightly positive correlation between grammatical complexity and fusion.",
    "Dv) Shcherbakova and colleagues’ analysis showed a slightly positive correlation between fusion and exotericity and between informativity and exotericity."
]

In [None]:
# Example 3 (Alt-1): LNH Study
# This is the revised Example 3 to solve perif bias in text.
sentences = ["Pa: The linguistic niche hypothesis (LNH) posits that the exotericity of languages (how prevalent non-native speakers are) and grammatical complexity are inversely related, which the LNH ascribes to attrition of complex grammatical rules as more non-native speakers adopt the language but fail to acquire those rules. Focusing on two characteristics that are positive indices of grammatical complexity, fusion (when new phonemes arise from the merger of previously distinct ones) and informativity (languages’ capacity for meaningful variation), Olena Shcherbakova and colleagues conducted a quantitative analysis for more than 1,300 languages and claim the outcome is inconsistent with the LNH.",
    "P1: The linguistic niche hypothesis (LNH) posits that the exotericity of languages (how prevalent non-native speakers are) and grammatical complexity are inversely related, which the LNH ascribes to attrition of complex grammatical rules as more non-native speakers adopt the language but fail to acquire those rules.",
    "P2: Focusing on two characteristics that are positive indices of grammatical complexity, fusion (when new phonemes arise from the merger of previously distinct ones) and informativity (languages’ capacity for meaningful variation), Olena Shcherbakova and colleagues conducted a quantitative analysis for more than 1,300 languages and claim the outcome is inconsistent with the LNH.",
    "Pa: The linguistic niche hypothesis (LNH) posits that the exotericity of languages (how prevalent non-native speakers are) and grammatical complexity are inversely related, which the LNH ascribes to attrition of complex grammatical rules as more non-native speakers adopt the language but fail to acquire those rules. Focusing on two characteristics that are positive indices of grammatical complexity, fusion (when new phonemes arise from the merger of previously distinct ones) and informativity (languages’ capacity for meaningful variation), Olena Shcherbakova and colleagues conducted a quantitative analysis for more than 1,300 languages and claim the outcome is inconsistent with the LNH.",
    "Q?: Which finding, if true, would most directly support Shcherbakova and colleagues’ claim?",
    "Ax) Shcherbakova and colleagues’ analysis showed a slightly negative correlation between grammatical complexity and fusion and between grammatical complexity and informativity",
    "Bx) Shcherbakova and colleagues’ analysis showed a slightly negative correlation between grammatical complexity and exotericity.",
    "Cx) Shcherbakova and colleagues’ analysis showed a slightly positive correlation between grammatical complexity and fusion.",
    "Dv) Shcherbakova and colleagues’ analysis showed a slightly positive correlation between grammatical fusion and exotericity and between grammatical informativity and exotericity."
]

In [None]:
# Example 3 (Alt-2): LNH Study
# This is the revised Example 3 to solve long-comp confusion in text.
sentences = ["Pa: The Linguistic Niche Hypothesis (LNH) suggests that as more non-native speakers adopt a language, its grammatical complexity decreases. This occurs because these speakers often do not fully acquire complex grammatical rules. Olena Shcherbakova and her colleagues focused on two measures of grammatical complexity: fusion (the merging of distinct phonemes) and informativity (the capacity for meaningful variation). They analyzed over 1,300 languages and found their results contradict the LNH.",
    "P1: The Linguistic Niche Hypothesis (LNH) suggests that as more non-native speakers adopt a language, its grammatical complexity decreases. This occurs because these speakers often do not fully acquire complex grammatical rules.",
    "P2: Olena Shcherbakova and her colleagues focused on two measures of grammatical complexity: fusion (the merging of distinct phonemes) and informativity (the capacity for meaningful variation).",
    "P3: They analyzed over 1,300 languages and found their results contradict the LNH.",
    "Q?: Which finding, if true, would most directly support Shcherbakova and colleagues’claim?",
    "Ax) Shcherbakova and colleagues’ analysis showed a slightly negative correlation between grammatical complexity and fusion, as well as a slightly negative correlation between grammatical complexity and informativity.",
    "Bx) Shcherbakova and colleagues’ analysis showed a slightly negative correlation between grammatical complexity and exotericity.",
    "Cx) Shcherbakova and colleagues’ analysis showed a slightly positive correlation between grammatical complexity and fusion.",
    "Dv) Shcherbakova and colleagues’ analysis showed a slightly positive correlation between grammatical fusion and exotericity, as well as a slightly positive correlation between grammatical informativity and exotericity."
]

In [None]:
# Example 3 (Alt-3): LNH Study
# This is the revised Example 3 to solve num-cont confusion in text.
sentences = ["Pa: The Linguistic Niche Hypothesis (LNH) suggests that as more non-native speakers adopt a language, its grammatical complexity decreases. This occurs because these speakers often do not fully acquire complex grammatical rules. Olena Shcherbakova and her colleagues focused on two measures of grammatical complexity: fusion (the merging of distinct phonemes) and informativity (the capacity for meaningful variation). They analyzed over 1,300 languages and found their results contradict the LNH.",
    "P1: The Linguistic Niche Hypothesis (LNH) suggests that as more non-native speakers adopt a language, its grammatical complexity decreases. This occurs because these speakers often do not fully acquire complex grammatical rules. Three initial factors.",
    "P2: Olena Shcherbakova and her colleagues focused on two measures of grammatical complexity: fusion (the merging of distinct phonemes) and informativity (the capacity for meaningful variation). Add two additional factors, and three factors are studied.",
    "P3: They analyzed over 1,300 languages and found their results contradict the LNH.",
    "Q?: Which finding, if true, would most directly support Shcherbakova and colleagues’claim?",
    "Ax) Shcherbakova and colleagues’ analysis showed a slightly negative correlation between grammatical complexity and fusion, as well as a slightly negative correlation between grammatical complexity and informativity. Three factors.",
    "Bx) Shcherbakova and colleagues’ analysis showed a slightly negative correlation between grammatical complexity and exotericity. Two factors.",
    "Cx) Shcherbakova and colleagues’ analysis showed a slightly positive correlation between grammatical complexity and fusion. Two factors.",
    "Dv) Shcherbakova and colleagues’ analysis showed a slightly positive correlation between grammatical fusion and exotericity, as well as a slightly positive correlation between grammatical informativity and exotericity. Three factors."
]

In [None]:
# Example 3 (Alt-4): LNH Study
# This is the revised Example 3 to solve mul-rel confusion in text by 5V.
sentences = ["Pa: LNH suggests that exotericity != complexity, and exotericity = non-native speakers. Olena Shcherbakova and her colleagues focused on two measures: 1) fusion = complexity, and 2) informativity = complexity. They analyzed over 1,300 languages and found their results contradict the LNH.",
    "P1: LNH suggests that exotericity != complexity, while exotericity = non-native speakers. Inital three factors ",
    "P2: Olena Shcherbakova and her colleagues focused on two measures: 1) fusion = complexity, and 2) informativity = complexity.",
    "P3: They analyzed over 1,300 languages and found their results contradict the LNH.",
    "Q?: Which finding, if true, would most directly support Shcherbakova and colleagues’claim?",
    "Ax) complexity != fusion, and complexity != informativity.",
    "Bx) complexity != exotericity. Two factors.",
    "Cx) complexity = fusion. Two factors.",
    "Dv) fusion = exotericity, and informativity = exotericity."
]

In [None]:
# Example 3 (Alt-5): LNH Study
# This is the revised Example 3 to solve mul-rel and num-conut confusions in text by 5V.
sentences = ["Pa: LNH suggests that exotericity != complexity, and exotericity = non-native speakers. Olena Shcherbakova and her colleagues focused on two measures: 1) fusion = complexity, and 2) informativity = complexity. They analyzed over 1,300 languages and found their results contradict the LNH.",
    "P1: LNH suggests that exotericity != complexity, while exotericity = non-native speakers. Inital three factors ",
    "P2: Olena Shcherbakova and her colleagues focused on two measures: 1) fusion = complexity, and 2) informativity = complexity. Three factors.",
    "P3: They analyzed over 1,300 languages and found their results contradict the LNH. Five factors in total, only three factors were studied.",
    "Q?: Which finding, if true, would most directly support Shcherbakova and colleagues’claim?",
    "Ax) complexity != fusion, and complexity != informativity. Three factors.",
    "Bx) complexity != exotericity. Two factors.",
    "Cx) complexity = fusion. Two factors.",
    "Dv) fusion = exotericity, and informativity = exotericity. Three factors."
]

In [24]:
# Example 4: Gabler Ulysses
# This is a hard question to discover long-comp confusion in text by 5V.
sentences = ["Pa: The many editions of James Joyce's 1922 novel Ulysses are not textually identical, and scholars debate which versions reflect Joyce's authorial intent. One no longer widely read edition is the 1984 'critical and synoptic edition' edited by Hans Walter Gabler, which followed French and German editorial theories rather than editorial traditions of the United States and United Kingdom and which was later found to have introduced errors due to Gabler's choice to consult facsimile manuscripts rather than using only originals. However, few Joyce scholars worldwide had expertise in such textual issues, and most of those who did worked on the edition with Gabler. So, it is unsurprising that initial scholarly reviews of the 1984 edition were mostly _______.",
    "P1: The many editions of James Joyce's 1922 novel Ulysses are not textually identical, and scholars debate which versions reflect Joyce's authorial intent.",
    "P2: One no longer widely read edition is the 1984 'critical and synoptic edition' edited by Hans Walter Gabler, which followed French and German editorial theories rather than editorial traditions of the United States and United Kingdom and which was later found to have introduced errors due to Gabler's choice to consult facsimile manuscripts rather than using only originals.",
    "P3: However, few Joyce scholars worldwide had expertise in such textual issues, and most of those who did worked on the edition with Gabler. So, it is unsurprising that initial scholarly reviews of the 1984 edition were mostly _______.",
    "Q?: Which choice most logically complete the text?",
    "Ax) negative, since those Joyce scholars with the necessary expertise to write a review of the 1984 edition would be aware that facsimile manuscripts cannot be produced with a high enough fidelity to the original to ensure that relying on them will not introduce editorial errors.",
    "Bx) positive, since scholars who reviewed the 1984 edition were unaffiliated with its production and were mostly either Joyce specialists who were largely unfamiliar with editorial theories and practices or specialists in such theories and practices who were insufficiently familiar with Joyce.",
    "Cx) negative, since any scholar with expertise in editorial theories of the United States and United Kingdom as well as French and German editorial theories most likely worked with Gabler on the 1984 edition and would therefore not review it.",
    "Dv) positive, since Ulysses is a novel in English and the 1984 edition would therefore be more widely reviewed in United States and United Kingdom publications than in French and German publications."
]

In [40]:
# Example 4 (Alt-1): Gabler Ulysses
# This is a hard question to discover irrelevance confusion in text by deleting V1(+/-).
sentences = ["Pa: The many editions of James Joyce's 1922 novel Ulysses are not textually identical, and scholars debate which versions reflect Joyce's authorial intent. One no longer widely read edition is the 1984 'critical and synoptic edition' edited by Hans Walter Gabler, which followed French and German editorial theories rather than editorial traditions of the United States and United Kingdom and which was later found to have introduced errors due to Gabler's choice to consult facsimile manuscripts rather than using only originals. However, few Joyce scholars worldwide had expertise in such textual issues, and most of those who did worked on the edition with Gabler. So, it is unsurprising that initial scholarly reviews of the 1984 edition were mostly _______.",
    "P1: The many editions of James Joyce's 1922 novel Ulysses are not textually identical, and scholars debate which versions reflect Joyce's authorial intent.",
    "P2: One no longer widely read edition is the 1984 'critical and synoptic edition' edited by Hans Walter Gabler, which followed French and German editorial theories rather than editorial traditions of the United States and United Kingdom and which was later found to have introduced errors due to Gabler's choice to consult facsimile manuscripts rather than using only originals.",
    "P3: However, few Joyce scholars worldwide had expertise in such textual issues, and most of those who did worked on the edition with Gabler. So, it is unsurprising that initial scholarly reviews of the 1984 edition were mostly _______.",
    "Q?: Which choice most logically complete the text?",
    "Ax) unknown, since those Joyce scholars with the necessary expertise to write a review of the 1984 edition would be aware that facsimile manuscripts cannot be produced with a high enough fidelity to the original to ensure that relying on them will not introduce editorial errors.",
    "Bx) unknown, since scholars who reviewed the 1984 edition were unaffiliated with its production and were mostly either Joyce specialists who were largely unfamiliar with editorial theories and practices or specialists in such theories and practices who were insufficiently familiar with Joyce.",
    "Cx) unknown, since any scholar with expertise in editorial theories of the United States and United Kingdom as well as French and German editorial theories most likely worked with Gabler on the 1984 edition and would therefore not review it.",
    "Dv) unknown, since Ulysses is a novel in English and the 1984 edition would therefore be more widely reviewed in United States and United Kingdom publications than in French and German publications."
]

In [45]:
# Example 4 (Alt-2): Gabler Ulysses
# This is a hard question to mitigate long-comp confusion in text by 5V.
sentences = ["Pa: The many editions of James Joyce's 1922 novel Ulysses are not textually identical, and scholars debate which versions reflect Joyce's authorial intent. One no longer widely read edition is the 1984 'critical and synoptic edition' edited by Hans Walter Gabler, which followed French and German editorial theories rather than editorial traditions of the United States and United Kingdom and which was later found to have introduced errors due to Gabler's choice to consult facsimile manuscripts rather than using only originals. However, few Joyce scholars worldwide had expertise in such textual issues, and most of those who did worked on the edition with Gabler. So, it is unsurprising that initial scholarly reviews of the 1984 edition were mostly _______.",
    "P1: The many editions of James Joyce's 1922 novel Ulysses are not textually identical, and scholars debate which versions reflect Joyce's authorial intent. One; Contrast; Many; Negative.",
    "P2: One no longer widely read edition is the 1984 'critical and synoptic edition' edited by Hans Walter Gabler, which followed French and German editorial theories rather than editorial traditions of the United States and United Kingdom and which was later found to have introduced errors due to Gabler's choice to consult facsimile manuscripts rather than using only originals. One; Two; Contrast; Negative. Effect",
    "P3: However, few Joyce scholars worldwide had expertise in such textual issues, and most of those who did worked on the edition with Gabler. So, it is unsurprising that initial scholarly reviews of the 1984 edition were mostly _______. One; Negative; Many.",
    "Q?: Which choice most logically complete the text?",
    "Ax) nsince those Joyce scholars with the necessary expertise to write a review of the 1984 edition would be aware that facsimile manuscripts cannot be produced with a high enough fidelity to the original to ensure that relying on them will not introduce editorial errors. Many; One; Negative; Positive. Effect",
    "Bx) since scholars who reviewed the 1984 edition were unaffiliated with its production and were mostly either Joyce specialists who were largely unfamiliar with editorial theories and practices or specialists in such theories and practices who were insufficiently familiar with Joyce. Many; One; Positive; Negative.",
    "Cx) since any scholar with expertise in editorial theories of the United States and United Kingdom as well as French and German editorial theories most likely worked with Gabler on the 1984 edition and would therefore not review it. Many; One; Two; Compare. Negative",
    "Dv) since Ulysses is a novel in English and the 1984 edition would therefore be more widely reviewed in United States and United Kingdom publications than in French and German publications. One; Two; Contrast. Positive."
]

In [20]:
# Example 4 (Alt-3): Gabler Ulysses
# This is a hard question to solve long-comp confusion in text by 5V.
sentences = ["Pa: The many editions of James Joyce's 1922 novel Ulysses are not textually identical, and scholars debate which versions reflect Joyce's authorial intent. One no longer widely read edition is the 1984 'critical and synoptic edition' edited by Hans Walter Gabler, which followed French and German editorial theories rather than editorial traditions of the United States and United Kingdom and which was later found to have introduced errors due to Gabler's choice to consult facsimile manuscripts rather than using only originals. However, few Joyce scholars worldwide had expertise in such textual issues, and most of those who did worked on the edition with Gabler. So, it is unsurprising that initial scholarly reviews of the 1984 edition were mostly _______.",
    "P1: One; Contrast; Many; Negative.",
    "P2: One; Negative Two; Contrast; Effect; Negative",
    "P3: One; Negative; Many.",
    "Q?: Which choice most logically match the text?",
    "Ax) Many; One; Negative; Positive. Effect",
    "Bx) Many; One; Positive; Negative.",
    "Cx) Many; One; Two; Compare. Negative",
    "Dv) One; Two; Contrast. Positive."
]

In [None]:
# Example 5: GRE Cotton Mother
# This is a hard question GRE reading used to discover long-pass bias, long-comp confusion, num-cont confusion, and mul-rel confusion in text.
sentences = ["Pa: Among many historians a belief persists that Cotton Mather's biographies of some of the settlers of the Massachusetts Bay Colony (published 1702) are exercises in hagiography, endowing their subjects with saintly piety at the expense of historical accuracy. Yet modern studies have profited both from the breadth of information that Mather provides in, for example, his discussions of colonial medicine and from his critical observations of such leading figures as Governor John Winthrop. Mather's wry humor as demonstrated by his detailed descriptions of events such as Winthrop's efforts to prevent wood-stealing is overlooked by those charging Mather with presenting his subjects as extremely pious. The charge also obscures Mather's concern with the settlers material, not just spiritual, prosperity. Further, this pejorative view underrates the biographies value as chronicles: Mather amassed all sorts of published and unpublished documents as sources, and his selection of key eventsshows a marked sensitivity to the nature of the colony's development.",
    "P1: Among many historians a belief persists that Cotton Mather's biographies of some of the settlers of the Massachusetts Bay Colony (published 1702) are exercises in hagiography, endowing their subjects with saintly piety at the expense of historical accuracy.",
    "P2: Yet modern studies have profited both from the breadth of information that Mather provides in, for example, his discussions of colonial medicine and from his critical observations of such leading figures as Governor John Winthrop.",
    "P3: Mather's wry humor as demonstrated by his detailed descriptions of events such as Winthrop's efforts to prevent wood-stealing is overlooked by those charging Mather with presenting his subjects as extremely pious.",
    "P4: The charge also obscures Mather's concern with the settlers material, not just spiritual, prosperity. (P5:) Further, this pejorative view underrates the biographies value as chronicles: Mather amassed all sorts of published and unpublished documents as sources, and his selection of key eventsshows a marked sensitivity to the nature of the colony's development.",
    "Av) The primary purpose of the passage is to argue against a theory universally accepted by historical researchers",
    "Bx) The primary purpose of the passage is to call attention to an unusual approach to documenting a historical era",
    "Cx) The primary purpose of the passage is to summarize research on a specific historical figure",
    "Dx) The primary purpose of the passage is to counter a particular view about the work of a biographer",
    "Ex) The primary purpose of the passage is to point out subtle differences among controversial historical reports"
]

In [None]:
# Example 5 (Alt-1): GRE Cotton Mother
# This is the revised Example 3 to solve long-pass confusion and spec-confusion in graduate level text.
sentences = ["Pa: Among many historians a belief persists that Cotton Mather's biographies of some of the settlers of the Massachusetts Bay Colony (published 1702) are exercises in hagiography, endowing their subjects with saintly piety at the expense of historical accuracy. Yet modern studies have profited both from the breadth of information that Mather provides in, for example, his discussions of colonial medicine and from his critical observations of such leading figures as Governor John Winthrop. Mather's wry humor as demonstrated by his detailed descriptions of events such as Winthrop's efforts to prevent wood-stealing is overlooked by those charging Mather with presenting his subjects as extremely pious. The charge also obscures Mather's concern with the settlers material, not just spiritual, prosperity. Further, this pejorative view underrates the biographies value as chronicles: Mather amassed all sorts of published and unpublished documents as sources, and his selection of key eventsshows a marked sensitivity to the nature of the colony's development.",
    "P1: Many believed that Cotton Mather's biographies of settlers of Massachusetts Bay Colony are exercises in hagiography.",
    "P2: Yet modern studies have profited both from information that Mather provides and from his critical observations of leading figures.",
    "P3: Mather's wry humor is overlooked by those charging Mather with presenting his subjects as extremely pious.",
    "P4: The charge also obscures Mather's concern with the settlers material. (P5:) Further, this pejorative view underrates the biographies value as chronicles.",
    "Av) The primary purpose of the passage is to argue against a theory universally accepted by historical researchers",
    "Bx) The primary purpose of the passage is to call attention to an unusual approach to documenting a historical era",
    "Cx) The primary purpose of the passage is to summarize research on a specific historical figure",
    "Dx) The primary purpose of the passage is to counter a particular view about the work of a biographer",
    "Ex) The primary purpose of the passage is to point out subtle differences among controversial historical reports"
]

In [None]:
# Example 5 (Alt-2): GRE Cotton Mother
# This is the revised Example 3 to mitigate spec-confusions in text by 5V.
sentences = ["Pa: Among many historians a belief persists that Cotton Mather's biographies of some of the settlers of the Massachusetts Bay Colony (published 1702) are exercises in hagiography, endowing their subjects with saintly piety at the expense of historical accuracy. Yet modern studies have profited both from the breadth of information that Mather provides in, for example, his discussions of colonial medicine and from his critical observations of such leading figures as Governor John Winthrop. Mather's wry humor as demonstrated by his detailed descriptions of events such as Winthrop's efforts to prevent wood-stealing is overlooked by those charging Mather with presenting his subjects as extremely pious. The charge also obscures Mather's concern with the settlers material, not just spiritual, prosperity. Further, this pejorative view underrates the biographies value as chronicles: Mather amassed all sorts of published and unpublished documents as sources, and his selection of key eventsshows a marked sensitivity to the nature of the colony's development.",
    "P1: Many believed that Cotton Mather's biographies of settlers of Massachusetts Bay Colony are exercises in hagiography. One; Subjective.",
    "P2: Yet modern studies have profited both from information that Mather provides and from his critical observations of leading figures. Two; Objective. Positive.",
    "P3: Mather's wry humor is overlooked by those charging Mather with presenting his subjects as extremely pious. One; Objective; Negative.",
    "P4: The charge also obscures Mather's concern with the settlers material. One; Negative; Subjective. (P5:) Further, this pejorative view underrates the biographies value as chronicles. One; Subjective; Negative.",
    "Ax) The primary purpose of the passage is to argue against a theory universally accepted by historical researchers. One; Many; Objective.",
    "Bx) The primary purpose of the passage is to call attention to an unusual approach to documenting a historical era. One; Many; Subjective.",
    "Cx) The primary purpose of the passage is to summarize research on a specific historical figure. One; Subjective.",
    "Dv) The primary purpose of the passage is to counter a particular view about the work of a biographer. One; Subjective; Negative",
    "Ex) The primary purpose of the passage is to point out subtle differences among controversial historical reports. Two; Negative; Objective; Contrast."
]

In [None]:
# Example 5 (Alt-3): GRE Cotton Mother
# This is the revised Example 3 to solve spec-confusions in text by 5V.
sentences = ["Pa: One; Subjective; Negative.",
    "P1: Many: One; Subjective.",
    "P2: Two; Objective. Positive.",
    "P3: One; Objective; Negative.",
    "P4: One; Negative; Subjective. One; Subjective; Negative.",
    "Ax) One; Many; Objective.",
    "Bx) One; Many; Subjective.",
    "Cx) One; Subjective.",
    "Dv) One; Subjective; Negative",
    "Ex) Two; Negative; Objective; Contrast."
]

## 4. Generate Embeddings for CeO2 Sample Sentences.

Used a pre-trained Sentence Transformer model, `all-mpnet-base-v2`, to generate embeddings for each sentence. This model is a good general-purpose choice, with mapping of sentences & paragraphs to a **768** dimensional dense vector space and can be used for tasks like clustering or semantic search.

`all-mpnet-base-v2`: is a sentence transformer model that converts sentences and paragraphs into numerical vectors. These vectors, called embeddings, capture the semantic meaning of the text.

**Key Features and What it's Used For**:
* Sentence Embeddings: The model takes a sentence (or short paragraph) as input and produces a dense vector representation.
* 768 Dimensions: is a common dimensionality for these types of models.
* MPNet Architecture: is a model that combines the strengths of BERT and XLNet to better understand word order and context.
* General Purpose: `all-mpnet-base-v2` is designed to be a general-purpose model, meaning it performs well on a variety of tasks.

**Common Applications**:
* Semantic Search: Finding sentences or documents with similar meanings.
* Information Retrieval: Pulling up relevant information based on a text query.
* Clustering: Grouping sentences or paragraphs with similar meanings.
* Sentence Similarity: Measuring how alike two pieces of text are.

**How to Use It** (in Python with Sentence Transformers):

    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('all-mpnet-base-v2')
        sentences = [
        "This is a simple example.",
        "Here is another sentence.",
        "A third sentence for demonstration."
    ]
    embeddings = model.encode(sentences)
    print(embeddings)  # Output: A list of 768-dimensional vectors

**Why is it Popular?**
* Inclusive dataset: a massive dataset of over 1 billion sentence pairs from
Natural Language Inference (NLI) datasets, Paraphrase datasets, and a large collection of other published English data from various sources.
* Good Performance: It generally ranks high in accuracy for many semantic text understanding tasks.
* Efficiency: While very effective, it's also relatively efficient to use.
* Ease of Use: Libraries like Sentence Transformers make it very easy to download and use.

In [46]:
# Load the pre-trained model
model = SentenceTransformer('all-mpnet-base-v2')

# Generate the embeddings
embeddings = model.encode(sentences)

print("Shape of embeddings:", embeddings.shape)
print("Example embedding (first sentence):\n", embeddings[8][:100]) # Print the first 20 dimensions only

Shape of embeddings: (9, 768)
Example embedding (first sentence):
 [ 2.7918559e-04  3.8392488e-02 -1.1461860e-02  7.8423927e-03
 -7.8011863e-02 -2.2147162e-02  4.9371026e-02 -4.1693398e-03
  2.8112642e-02  3.3739156e-03 -2.8768408e-03 -8.8016093e-02
  3.6667313e-02  1.3735986e-02  1.8709023e-02  5.4249067e-02
  6.0190752e-02  1.9590722e-02 -3.3756670e-02 -1.9371241e-02
 -2.3903828e-02  3.2254808e-02  4.7508858e-02 -1.3700384e-02
  7.1731053e-02  4.8390910e-02  3.6884539e-02 -1.5454598e-03
 -2.3542551e-02  6.3554714e-03 -6.3347486e-03  2.5571546e-02
  4.8902514e-03  3.4401137e-02  1.3632779e-06 -1.4547161e-02
  2.6920917e-02 -1.2276147e-03  8.9691706e-02 -6.6200187e-03
  1.2338617e-01  1.7411795e-02  1.1350488e-02 -2.9012609e-02
  3.1468984e-02 -1.2626675e-02  1.0901851e-02  2.9353630e-02
 -6.8607859e-02 -4.5381254e-03  1.2959260e-02 -2.7123509e-02
  2.0032126e-02  8.4511070e-03  1.1911872e-02  1.6508201e-02
 -3.4251112e-02 -3.5486311e-02  3.0807922e-02  2.5018636e-02
  4.0178676e-03  1

## 5. Reduce Dimensionality: 2D with PCA

**PCA** = principal component analysis.

Principal Components are composite dimensions from the existing 768-dimention embedding dataset. It helps to reduce dimensions while maintaning the most variances mathematically, thus being able to discern the most of datapoints. The ability to discern things is simply the intelligence, or smartness.

In [None]:
# Reduce to 2 dimensions using PCA
pca_2d = PCA(n_components=2)
reduced_embeddings_2d = pca_2d.fit_transform(embeddings)

# Visualize the 2D embeddings
plt.figure(figsize=(4, 3))
plt.scatter(reduced_embeddings_2d[:, 0], reduced_embeddings_2d[:, 1])

# Annotate each point with the corresponding sentence
for i, txt in enumerate(sentences):
    plt.annotate(txt[:2], (reduced_embeddings_2d[i, 0], reduced_embeddings_2d[i, 1]))

plt.title("2D Visualization of Sentence Embeddings (PCA)")
plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.grid(True)
plt.show()

## 6. Reduce Dimensionality: 3D with PCA

### 6.1 3D PCA static

In [None]:
# Reduce to 3 dimensions using PCA
pca_3d = PCA(n_components=3)
reduced_embeddings_3d = pca_3d.fit_transform(embeddings)

# Visualize the 3D embeddings
fig_3d = plt.figure(figsize=(8, 6))
ax_3d = fig_3d.add_subplot(111, projection='3d')
scatter_3d = ax_3d.scatter(reduced_embeddings_3d[:, 0], reduced_embeddings_3d[:, 1], reduced_embeddings_3d[:, 2])

# Annotate each point with the corresponding sentence
for i, txt in enumerate(sentences):
    ax_3d.text(reduced_embeddings_3d[i, 0], reduced_embeddings_3d[i, 1], reduced_embeddings_3d[i, 2], txt[:2])

ax_3d.set_xlabel("PC 1")
ax_3d.set_ylabel("PC 2")
ax_3d.set_zlabel("PC 3")
ax_3d.set_title("3D Visualization of Sentence Embeddings (PCA)")
plt.show()

### 6.2 3D PCA rotational with P-tet

Question domain is illustrated in P-tet. Choices relation to Prompt-Question is shown by the distance from Choice to P-tet.

In [None]:
import plotly.graph_objects as go
import numpy as np
from sklearn.decomposition import PCA

labels = [s[:2] for s in sentences]  # Get first two letters.

# Perform PCA to reduce to 3 dimensions
pca_3d = PCA(n_components=3)
reduced_embeddings_3d = pca_3d.fit_transform(embeddings)

# Directly define the indices, assuming P1, P2, P3, Q? are at indices 1, 2, 3, and 4
p1_index = 1
p2_index = 2
p3_index = 3
q_index = 4

# Extract the 3D coordinates of P1, P2, P3, and Q?
p1_coords = reduced_embeddings_3d[p1_index]
p2_coords = reduced_embeddings_3d[p2_index]
p3_coords = reduced_embeddings_3d[p3_index]
q_coords = reduced_embeddings_3d[q_index]

# Define the vertices of the tetrahedron
tetrahedron_vertices = np.array([p1_coords, p2_coords, p3_coords, q_coords])

# Define the edges of the tetrahedron (indices of vertices)
tetrahedron_edges = [
    (0, 1), (0, 2), (0, 3),  # Edges from P1 to P2, P3, Q?
    (1, 2), (1, 3),          # Edges from P2 to P3, Q?
    (2, 3)                   # Edge from P3 to Q?
]

# Create the lines for the tetrahedron edges
lines = []
for i, (start_index, end_index) in enumerate(tetrahedron_edges):
    start_point = tetrahedron_vertices[start_index]
    end_point = tetrahedron_vertices[end_index]
    lines.append(
        go.Scatter3d(
            x=[start_point[0], end_point[0]],
            y=[start_point[1], end_point[1]],
            z=[start_point[2], end_point[2]],
            mode='lines',
            line=dict(color='red', width=4),  # Style the lines
            name=f'Edge {i+1}'
        )
    )

# Create the 3D scatter plot and add the lines
fig = go.Figure(data=[go.Scatter3d(
    x=reduced_embeddings_3d[:, 0],
    y=reduced_embeddings_3d[:, 1],
    z=reduced_embeddings_3d[:, 2],
    mode='markers+text',
    text=labels,  # Use first 2 characters of labels
    textposition="middle right",
    marker=dict(size=6),
    name='Data Points'
)] + lines)  # Combine scatter and lines

# Set the title and axis labels
fig.update_layout(
    title="3D PCA Visualization with Tetrahedron",
    scene=dict(
        xaxis_title="PC 1",
        yaxis_title="PC 2",
        zaxis_title="PC 3",
    ),
    margin=dict(l=0, r=0, b=0, t=0),
    showlegend=True
)

# Show the plot
fig.show()



### 6.3 P-sphere, P-tet, and Scatterplot

In [47]:
# Sphere, tetrahedron, and scatterplot.
import plotly.graph_objects as go
import numpy as np
from sklearn.decomposition import PCA

labels = [s[:2] for s in sentences]  # Get first two letters.

# Perform PCA to reduce to 3 dimensions
pca_3d = PCA(n_components=3)
reduced_embeddings_3d = pca_3d.fit_transform(embeddings)

# Directly define the indices, assuming P1, P2, P3, Q? are at indices 1, 2, 3, and 4
p1_index = 1
p2_index = 2
p3_index = 3
q_index = 4

# Extract the 3D coordinates of P1, P2, P3, and Q?
p1_coords = reduced_embeddings_3d[p1_index]
p2_coords = reduced_embeddings_3d[p2_index]
p3_coords = reduced_embeddings_3d[p3_index]
q_coords = reduced_embeddings_3d[q_index]

# Define the vertices of the tetrahedron
tetrahedron_vertices = np.array([p1_coords, p2_coords, p3_coords, q_coords])

# Define the edges of the tetrahedron (indices of vertices)
tetrahedron_edges = [
    (0, 1), (0, 2), (0, 3),  # Edges from P1 to P2, P3, Q?
    (1, 2), (1, 3),          # Edges from P2 to P3, Q?
    (2, 3)                   # Edge from P3 to Q?
]

# Create the lines for the tetrahedron edges
lines = []
for i, (start_index, end_index) in enumerate(tetrahedron_edges):
    start_point = tetrahedron_vertices[start_index]
    end_point = tetrahedron_vertices[end_index]
    lines.append(
        go.Scatter3d(
            x=[start_point[0], end_point[0]],
            y=[start_point[1], end_point[1]],
            z=[start_point[2], end_point[2]],
            mode='lines',
            line=dict(color='red', width=4),  # Style the lines
            name=f'Edge {i+1}'
        )
    )

# --- Start of new code for sphere calculation ---
# Points for sphere calculation
P = [p1_coords, p2_coords, p3_coords, q_coords]
x = [p[0] for p in P]
y = [p[1] for p in P]
z = [p[2] for p in P]

# System of equations to find sphere center (cx, cy, cz)
# (xi-cx)^2 + (yi-cy)^2 + (zi-cz)^2 = R^2
# Subtracting equation for P1 from P2, P3, P4:
# 2(x2-x1)cx + 2(y2-y1)cy + 2(z2-z1)cz = (x2^2+y2^2+z2^2) - (x1^2+y1^2+z1^2)
# ... and so on for P3 and P4

A = np.array([
    [2*(x[1]-x[0]), 2*(y[1]-y[0]), 2*(z[1]-z[0])],
    [2*(x[2]-x[0]), 2*(y[2]-y[0]), 2*(z[2]-z[0])],
    [2*(x[3]-x[0]), 2*(y[3]-y[0]), 2*(z[3]-z[0])]
])

B = np.array([
    (x[1]**2 + y[1]**2 + z[1]**2) - (x[0]**2 + y[0]**2 + z[0]**2),
    (x[2]**2 + y[2]**2 + z[2]**2) - (x[0]**2 + y[0]**2 + z[0]**2),
    (x[3]**2 + y[3]**2 + z[3]**2) - (x[0]**2 + y[0]**2 + z[0]**2)
])

try:
    sphere_center = np.linalg.solve(A, B)
    cx, cy, cz = sphere_center[0], sphere_center[1], sphere_center[2]
    sphere_radius = np.sqrt((x[0]-cx)**2 + (y[0]-cy)**2 + (z[0]-cz)**2)

    # Generate sphere surface points
    u = np.linspace(0, 2 * np.pi, 50) # Azimuthal angle
    v = np.linspace(0, np.pi, 25)    # Polar angle

    sphere_x = cx + sphere_radius * np.outer(np.cos(u), np.sin(v))
    sphere_y = cy + sphere_radius * np.outer(np.sin(u), np.sin(v))
    sphere_z = cz + sphere_radius * np.outer(np.ones(np.size(u)), np.cos(v)) # Corrected from np.ones_like(u) to np.ones(np.size(u))

    sphere_surface = go.Surface(
        x=sphere_x, y=sphere_y, z=sphere_z,
        opacity=0.3,
        colorscale='Blues', # You can choose other colorscales e.g. 'Viridis', 'RdBu'
        showscale=False, # Hide the color scale bar for the sphere
        name='Sphere'
    )
    sphere_added = True
except np.linalg.LinAlgError:
    print("Could not determine the sphere: Points might be coplanar or collinear.")
    sphere_surface = None # No sphere to add if calculation fails
    sphere_added = False
# --- End of new code for sphere calculation ---

# Create the 3D scatter plot and add the lines and sphere
data_elements = [go.Scatter3d(
    x=reduced_embeddings_3d[:, 0],
    y=reduced_embeddings_3d[:, 1],
    z=reduced_embeddings_3d[:, 2],
    mode='markers+text',
    text=labels,  # Use first 2 characters of labels
    textposition="middle right",
    marker=dict(size=4),
    name='Data Points'
)] + lines

if sphere_added and sphere_surface:
    data_elements.append(sphere_surface)

fig = go.Figure(data=data_elements)


# Set the title and axis labels
fig.update_layout(
    title="3D PCA Visualization with Tetrahedron and Sphere",
    scene=dict(
        xaxis_title="PC 1",
        yaxis_title="PC 2",
        zaxis_title="PC 3",
        # Aspect ratio to make the sphere look more like a sphere
        aspectmode='data' # 'auto', 'cube', 'data', 'manual'
    ),
    margin=dict(l=0, r=0, b=0, t=0),
    showlegend=True
)

# Show the plot
fig.show()

### 6.4 3D PCA from multiple angles static

In case rotational view above isn't available at temp environment, use multi-angle to illustrate.

In [None]:
# PCA from multiple angles static views
pca_3d = PCA(n_components=3)
reduced_embeddings_3d = pca_3d.fit_transform(embeddings)

#Create various angles for multiple 3-D rendering
elevations = [30, 30, 0, -30]  # Angles of elevation
azim_angles = [0, 45, 90, 135] # Azimuthal angles

for i, (elev, azim) in enumerate(zip(elevations, azim_angles)):
    fig = plt.figure(figsize=(8, 6))
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(reduced_embeddings_3d[:, 0], reduced_embeddings_3d[:, 1], reduced_embeddings_3d[:, 2])
    for j, txt in enumerate(sentences):
        ax.text(reduced_embeddings_3d[j, 0], reduced_embeddings_3d[j, 1], reduced_embeddings_3d[j, 2], txt[:2])
    ax.set_xlabel("PC 1")
    ax.set_ylabel("PC 2")
    ax.set_zlabel("PC 3")
    ax.view_init(elev=elev, azim=azim)
    plt.title(f"3D Plot (Elev={elev}, Azim={azim})")
    plt.show()

## 7. Using t-SNE for Dimensionality Reduction

**t-SNE** = t-distributed stochastic neighbor embedding. Unlike **PCA**, **t-SNE** demonstrate the embeddings in non-linear more scattered fashion.


### 7.1 2D t-SNE

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Reduce to 2 dimensions using t-SNE with a lower perplexity
tsne_2d = TSNE(n_components=2, random_state=42, n_iter=300, perplexity=min(5, len(sentences) - 1)) # Set perplexity <= 5 or n_samples - 1
reduced_embeddings_tsne_2d = tsne_2d.fit_transform(embeddings)

# Visualize the 2D embeddings (t-SNE)
plt.figure(figsize=(8, 6))
plt.scatter(reduced_embeddings_tsne_2d[:, 0], reduced_embeddings_tsne_2d[:, 1])

# Annotate each point with the corresponding sentence
for i, txt in enumerate(sentences):
    plt.annotate(txt[:2], (reduced_embeddings_tsne_2d[i, 0], reduced_embeddings_tsne_2d[i, 1]))

plt.title("2D Visualization of Sentence Embeddings (t-SNE)")
plt.xlabel("t-SNE D1")
plt.ylabel("t-SNE D2")
plt.grid(True)
plt.show()

###7.2 3D t-SNE static

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Reduce to 3 dimensions using t-SNE with a lower perplexity
tsne_3d = TSNE(n_components=3, random_state=42, n_iter=300, perplexity=min(5, len(sentences) - 1))
reduced_embeddings_tsne_3d = tsne_3d.fit_transform(embeddings)

# Visualize the 3D embeddings
fig_3d = plt.figure(figsize=(4, 3))
ax_3d = fig_3d.add_subplot(111, projection='3d')
scatter_3d = ax_3d.scatter(reduced_embeddings_tsne_3d[:, 0], reduced_embeddings_tsne_3d[:, 1], reduced_embeddings_tsne_3d[:, 2])

# Annotate each point with the first letter of the sentence
for i, txt in enumerate(sentences):
    ax_3d.text(reduced_embeddings_tsne_3d[i, 0], reduced_embeddings_tsne_3d[i, 1], reduced_embeddings_tsne_3d[i, 2], txt[:2])

ax_3d.set_xlabel("t-SNE D1")
ax_3d.set_ylabel("t-SNE D2")
ax_3d.set_zlabel("t-SNE D3")
ax_3d.set_title("3D Visualization of Sentence Embeddings (t-SNE)")
plt.show()

###7.3 3D t-SNE rotational

In [None]:
import plotly.graph_objects as go
import numpy as np
from sklearn.decomposition import PCA

# Assuming you have your data in a variable called 'embeddings'
# For example, let's create some dummy data for demonstration:
embeddings = np.random.rand(9, 5)  # 9 samples, 5 dimensions
sentences = [
    "Pa: Some fuel additives",
    "P1: Some fuel additives",
    "P2: In a 2015 study",
    "P3: While bioaccumulation",
    "Q?: Which finding",
    "Ax) When D. polymorpha",
    "Bv) The rate of CeO2-NPs",
    "Cx) D.polymorpha has been",
    "Dx) Compared with O. mykiss"
]
labels = [s[:2] for s in sentences]  # Get first two letters.

# Perform PCA to reduce to 3 dimensions
pca_3d = PCA(n_components=3)
reduced_embeddings_tsne_3d = pca_3d.fit_transform(embeddings)

# Directly define the indices, assuming P1, P2, P3, Q? are at indices 1, 2, 3, and 4
p1_index = 1
p2_index = 2
p3_index = 3
q_index = 4

# Extract the 3D coordinates of P1, P2, P3, and Q?
p1_coords = reduced_embeddings_tsne_3d[p1_index]
p2_coords = reduced_embeddings_tsne_3d[p2_index]
p3_coords = reduced_embeddings_tsne_3d[p3_index]
q_coords = reduced_embeddings_tsne_3d[q_index]

# Define the vertices of the tetrahedron
tetrahedron_vertices = np.array([p1_coords, p2_coords, p3_coords, q_coords])

# Define the edges of the tetrahedron (indices of vertices)
tetrahedron_edges = [
    (0, 1), (0, 2), (0, 3),  # Edges from P1 to P2, P3, Q?
    (1, 2), (1, 3),          # Edges from P2 to P3, Q?
    (2, 3)                   # Edge from P3 to Q?
]

# Create the lines for the tetrahedron edges
lines = []
for i, (start_index, end_index) in enumerate(tetrahedron_edges):
    start_point = tetrahedron_vertices[start_index]
    end_point = tetrahedron_vertices[end_index]
    lines.append(
        go.Scatter3d(
            x=[start_point[0], end_point[0]],
            y=[start_point[1], end_point[1]],
            z=[start_point[2], end_point[2]],
            mode='lines',
            line=dict(color='red', width=4),  # Style the lines
            name=f'Edge {i+1}'
        )
    )

# Create the 3D scatter plot and add the lines
fig = go.Figure(data=[go.Scatter3d(
    x=reduced_embeddings_tsne_3d[:, 0],
    y=reduced_embeddings_tsne_3d[:, 1],
    z=reduced_embeddings_tsne_3d[:, 2],
    mode='markers+text',
    text=labels,  # Use first 2 characters of labels
    textposition="middle right",
    marker=dict(size=8),
    name='Data Points'
)] + lines)  # Combine scatter and lines

# Set the title and axis labels
fig.update_layout(
    title="3D PCA Visualization with Tetrahedron",
    scene=dict(
        xaxis_title="PC 1",
        yaxis_title="PC 2",
        zaxis_title="PC 3",
    ),
    margin=dict(l=0, r=0, b=0, t=0),
    showlegend=True
)

# Show the plot
fig.show()


## 8. How about 5-Vs?

### 8.1 Initialized 5V (CL 1wk + DR 1wk)

Below is a typical preception with inital training/ While erros exist across V2/V3/V4/V5, the student can vaguely see the correct answer Bv from the wrong choices Ax, Cx, and Dx.

5Vs---Pa---P1---P2---P3---Q?---Ax---Bv---Cx---Dx---

V1-----0-----1-----0------0-----1------0-----1-----0-----1---

V2-----1-----0----0.5---1------0-----0----0.5----0-----0---

V3-----1-----0----0.5---1-----1------1-----1-----1-----1---

V4-----0-----1-----0-----0-----1------0-----1----0.5----1---

V5-----0-----1-----0----0.5----0------0-----0-----0-----1---

In [None]:
# Scatterplot and P-Tet: 5V-ini
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from sklearn.decomposition import PCA

# Your data as a string (including header row)
data_string = """datapoint,V1,V2,V3,V4,V5
Pa,0,1,1,0,0
P1,0,0,0,1,1
P2,0,0.5,0.5,0,0
P3,0,1,1,0,0.5
Q?,1,0,1,1,0
Ax,0,0,1,0,0
Bv,1,0.5,1,1,0
Cx,0,0,1,0.5,0
Dx,1,0,1,1,1
"""

# Load the data from the string into a Pandas DataFrame
from io import StringIO
df = pd.read_csv(StringIO(data_string))

# Extract the data for PCA (exclude the 'datapoint' column)
X = df.iloc[:, 1:].values  # Get values from columns V1 to V5

# Extract the datapoint labels
labels = df['datapoint'].tolist()

# Perform PCA
pca = PCA(n_components=3)
reduced_data = pca.fit_transform(X)

# Get the indices of P1, P2, P3, and Q?
p1_index = labels.index('P1')
p2_index = labels.index('P2')
p3_index = labels.index('P3')
q_index = labels.index('Q?')

# Extract the 3D coordinates of P1, P2, P3, and Q?
p1_coords = reduced_data[p1_index]
p2_coords = reduced_data[p2_index]
p3_coords = reduced_data[p3_index]
q_coords = reduced_data[q_index]

# Define the vertices of the tetrahedron
tetrahedron_vertices = np.array([p1_coords, p2_coords, p3_coords, q_coords])

# Define the edges of the tetrahedron (indices of vertices)
tetrahedron_edges = [
    (0, 1), (0, 2), (0, 3),  # Edges from P1 to P2, P3, Q?
    (1, 2), (1, 3),          # Edges from P2 to P3, Q?
    (2, 3)                   # Edge from P3 to Q?
]

# Create the 3D scatter plot for all points
scatter = go.Scatter3d(
    x=reduced_data[:, 0],
    y=reduced_data[:, 1],
    z=reduced_data[:, 2],
    mode='markers+text',
    text=labels,
    textposition="middle right",
    marker=dict(size=4),
    name='Data Points'  # Add a name for the scatter plot
)

# Create the lines for the tetrahedron edges
lines = []
for i, (start_index, end_index) in enumerate(tetrahedron_edges):
    start_point = tetrahedron_vertices[start_index]
    end_point = tetrahedron_vertices[end_index]
    lines.append(
        go.Scatter3d(
            x=[start_point[0], end_point[0]],
            y=[start_point[1], end_point[1]],
            z=[start_point[2], end_point[2]],
            mode='lines',
            line=dict(color='red', width=4),  # Style the lines
            name=f'Edge {i+1}' # Add a name for each line.
        )
    )

# Combine scatter and lines
data = [scatter] + lines

# Set the title and axis labels
layout = go.Layout(
    title="3D PCA of Data with Tetrahedron",
    scene=dict(
        xaxis_title="PC 1",
        yaxis_title="PC 2",
        zaxis_title="PC 3",
    ),
    margin=dict(l=0, r=0, b=0, t=0),
    showlegend=True #show legend
)

# Create the figure
fig = go.Figure(data=data, layout=layout)

# Show the plot (this will display an interactive plot in Colab)
fig.show()


In [None]:
# Scatterplot, P-Tet, and P-sphere: 5V-ini
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from sklearn.decomposition import PCA
from io import StringIO

# Your data as a string (including header row)
data_string = """datapoint,V1,V2,V3,V4,V5
Pa,0,1,1,0,0
P1,0,0,0,1,1
P2,0,0.5,0.5,0,0
P3,0,1,1,0,0.5
Q?,1,0,1,1,0
Ax,0,0,1,0,0
Bv,1,0.5,1,1,0
Cx,0,0,1,0.5,0
Dx,1,0,1,1,1
"""

# Load the data from the string into a Pandas DataFrame
df = pd.read_csv(StringIO(data_string))

# Extract the data for PCA (exclude the 'datapoint' column)
X = df.iloc[:, 1:].values  # Get values from columns V1 to V5

# Extract the datapoint labels
labels = df['datapoint'].tolist()

# Perform PCA
pca = PCA(n_components=3)
reduced_data = pca.fit_transform(X)

# Get the indices of P1, P2, P3, and Q?
p1_index = labels.index('P1')
p2_index = labels.index('P2')
p3_index = labels.index('P3')
q_index = labels.index('Q?')

# Extract the 3D coordinates of P1, P2, P3, and Q?
p1_coords = reduced_data[p1_index]
p2_coords = reduced_data[p2_index]
p3_coords = reduced_data[p3_index]
q_coords = reduced_data[q_index]

# Define the vertices of the tetrahedron (also the points defining the sphere)
sphere_defining_points = [p1_coords, p2_coords, p3_coords, q_coords]
tetrahedron_vertices = np.array(sphere_defining_points)

# --- Functions to calculate sphere center and radius ---
def get_sphere_coeffs(p1, p2, p3, p4):
    """
    Calculates the coefficients A, B, C, D of the sphere equation:
    x^2 + y^2 + z^2 + Ax + By + Cz + D = 0
    passing through four points p1, p2, p3, p4.
    """
    points = np.array([p1, p2, p3, p4])
    # Form matrix M for the system M * [A, B, C, D]' = -[x^2+y^2+z^2]'
    M = np.ones((4, 4))
    M[:, 0] = points[:, 0]  # x coordinates
    M[:, 1] = points[:, 1]  # y coordinates
    M[:, 2] = points[:, 2]  # z coordinates

    # Right hand side vector
    rhs = -(points[:, 0]**2 + points[:, 1]**2 + points[:, 2]**2)

    try:
        # Solve for A, B, C, D
        coeffs = np.linalg.solve(M, rhs)
        return coeffs
    except np.linalg.LinAlgError:
        print("Error: The four points might be coplanar. Cannot determine a unique sphere.")
        return None

def calculate_sphere_center_radius(p1, p2, p3, p4):
    """
    Calculates the center and radius of a sphere passing through four points.
    p1, p2, p3, p4 are 3D points as NumPy arrays or lists [x, y, z].
    Returns (center, radius) or (None, None) if points are coplanar.
    """
    coeffs = get_sphere_coeffs(np.array(p1), np.array(p2), np.array(p3), np.array(p4))

    if coeffs is None:
        return None, None

    A, B, C, D_coeff = coeffs

    # Center (xc, yc, zc)
    xc = -A / 2
    yc = -B / 2
    zc = -C / 2
    center = np.array([xc, yc, zc])

    # Radius R
    # R^2 = xc^2 + yc^2 + zc^2 - D_coeff
    radius_sq = xc**2 + yc**2 + zc**2 - D_coeff
    if radius_sq < 0:
        print("Error: Calculated radius squared is negative (points might be collinear or an issue with PCA reduction).")
        return None, None
    radius = np.sqrt(radius_sq)

    return center, radius

# --- Function to generate Plotly sphere surface ---
def get_plotly_sphere_surface(center, radius, color='rgba(0,180,255,0.3)', resolution=50, name='Circumsphere'):
    """
    Generates Plotly go.Surface data for a sphere.
    center: NumPy array or list for the sphere's center [xc, yc, zc].
    radius: Radius of the sphere.
    color: Color of the sphere as an rgba string (e.g., 'rgba(R,G,B,A)').
    resolution: Number of points for theta and phi.
    name: Legend name for the sphere.
    """
    theta = np.linspace(0, 2 * np.pi, resolution)
    phi = np.linspace(0, np.pi, resolution)
    theta, phi = np.meshgrid(theta, phi)

    x = center[0] + radius * np.cos(theta) * np.sin(phi)
    y = center[1] + radius * np.sin(theta) * np.sin(phi)
    z = center[2] + radius * np.cos(phi)

    # To achieve a single color with opacity, we set the color directly in the colorscale
    # and ensure opacity is handled by the color string itself or the opacity property of go.Surface

    # Extract RGB from the rgba string if provided, default to a blue if format is unexpected
    try:
        rgb_color_part = color.split('(')[1].split(')')[0].split(',')
        r, g, b = rgb_color_part[0], rgb_color_part[1], rgb_color_part[2]
        plotly_color = f'rgb({r},{g},{b})'
        opacity_val = float(rgb_color_part[3]) if len(rgb_color_part) > 3 else 0.3
    except:
        plotly_color = 'rgb(0,180,255)' # Default blue
        opacity_val = 0.3

    return go.Surface(
        x=x, y=y, z=z,
        colorscale=[[0, plotly_color], [1, plotly_color]], # Solid color
        showscale=False,
        opacity=opacity_val,
        name=name,
        hoverinfo='skip' # Optional: disable hover for the sphere surface
    )

# --- Plotting ---

# Create the 3D scatter plot for all points
scatter = go.Scatter3d(
    x=reduced_data[:, 0],
    y=reduced_data[:, 1],
    z=reduced_data[:, 2],
    mode='markers+text',
    text=labels,
    textposition="middle right",
    marker=dict(size=4, color='black'), # Changed marker color for better visibility against sphere
    name='Data Points'
)

# Define the edges of the tetrahedron
tetrahedron_edges_indices = [
    (0, 1), (0, 2), (0, 3),  # Edges from P1
    (1, 2), (1, 3),          # Edges from P2
    (2, 3)                   # Edge from P3
]

# Create the lines for the tetrahedron edges
lines = []
for i, (start_idx, end_idx) in enumerate(tetrahedron_edges_indices):
    start_point = tetrahedron_vertices[start_idx]
    end_point = tetrahedron_vertices[end_idx]
    lines.append(
        go.Scatter3d(
            x=[start_point[0], end_point[0]],
            y=[start_point[1], end_point[1]],
            z=[start_point[2], end_point[2]],
            mode='lines',
            line=dict(color='red', width=3),
            name=f'Tetrahedron Edge' # Simplified name
        )
    )
# To avoid multiple "Tetrahedron Edge" legends, only the first one will show by default if names are identical.
# Or, make them unique if needed, or group them. For simplicity, keep as is or assign name to only one.
if lines:
    lines[0].showlegend = True # Show legend for the first edge only as representative
    for line_trace in lines[1:]:
        line_trace.showlegend = False


# Calculate sphere center and radius using the PCA-reduced coordinates
sphere_center, sphere_radius = calculate_sphere_center_radius(p1_coords, p2_coords, p3_coords, q_coords)

# Initialize data list for the figure
data_traces = [scatter] + lines

# Add sphere to the plot if calculation was successful
sphere_trace = None
if sphere_center is not None and sphere_radius is not None:
    print(f"Sphere Center (PCA coords): {sphere_center}")
    print(f"Sphere Radius (PCA coords): {sphere_radius}")
    sphere_trace = get_plotly_sphere_surface(
        sphere_center,
        sphere_radius,
        color='rgba(100, 180, 255, 0.3)', # Light blue, semi-transparent
        resolution=40, # Lower resolution for faster rendering, increase for smoother sphere
        name='Circumsphere P1-P2-P3-Q?'
    )
    data_traces.append(sphere_trace)
else:
    print("Could not calculate sphere parameters. Sphere will not be plotted.")


# Set the title and axis labels
layout = go.Layout(
    title="3D PCA: Data Points, Tetrahedron (P1-P2-P3-Q?), and Circumsphere",
    scene=dict(
        xaxis_title="PC 1",
        yaxis_title="PC 2",
        zaxis_title="PC 3",
        aspectmode='data' # 'data', 'cube', 'auto', 'manual'
                         # 'data' ensures that the scaling of axes matches the data range
                         # 'cube' makes the plot region a cube
    ),
    margin=dict(l=0, r=0, b=0, t=0),
    showlegend=True
)

# Create the figure
fig = go.Figure(data=data_traces, layout=layout)

# Show the plot
fig.show()

### 8.2 Well-Trained 5Vs: CL 1wk + DP/BT>=1month

Below is a typical preception with 1+month training. Correct answer falls directly into the prompt/question subspace.

5Vs---Pa---P1---P2---P3---Q?---Ax---Bv---Cx---Dx---

V1-----0-----1-----0------0-----1------0-----1-----0-----1---

V2-----1-----0-----1-----1------0------0-----0-----0-----0---

V3-----1-----0-----1------1-----1------1-----1-----1-----1---

V4-----0-----1-----0-----0-----1------0------1-----1-----1---

V5-----0-----1-----0-----1-----0------0------0-----0-----1---

In [None]:
# Scaterplot and P-Tet: 5V-fin
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from sklearn.decomposition import PCA

# Your data as a string (including header row)
data_string = """datapoint,V1,V2,V3,V4,V5
Pa,0,1,1,0,1
P1,1,0,0,1,1
P2,0,1,1,0,0
P3,0,1,1,0,1
Q?,1,0,1,1,0
Ax,0,0,1,0,0
Bv,1,0,1,1,0
Cx,0,0,1,1,0
Dx,1,0,1,1,1
"""

# Load the data from the string into a Pandas DataFrame
from io import StringIO
df = pd.read_csv(StringIO(data_string))

# Extract the data for PCA (exclude the 'datapoint' column)
X = df.iloc[:, 1:].values  # Get values from columns V1 to V5

# Extract the datapoint labels
labels = df['datapoint'].tolist()

# Perform PCA
pca = PCA(n_components=3)
reduced_data = pca.fit_transform(X)

# Get the indices of P1, P2, P3, and Q?
p1_index = labels.index('P1')
p2_index = labels.index('P2')
p3_index = labels.index('P3')
q_index = labels.index('Q?')

# Extract the 3D coordinates of P1, P2, P3, and Q?
p1_coords = reduced_data[p1_index]
p2_coords = reduced_data[p2_index]
p3_coords = reduced_data[p3_index]
q_coords = reduced_data[q_index]

# Define the vertices of the tetrahedron
tetrahedron_vertices = np.array([p1_coords, p2_coords, p3_coords, q_coords])

# Define the edges of the tetrahedron (indices of vertices)
tetrahedron_edges = [
    (0, 1), (0, 2), (0, 3),  # Edges from P1 to P2, P3, Q?
    (1, 2), (1, 3),          # Edges from P2 to P3, Q?
    (2, 3)                   # Edge from P3 to Q?
]

# Create the 3D scatter plot for all points
scatter = go.Scatter3d(
    x=reduced_data[:, 0],
    y=reduced_data[:, 1],
    z=reduced_data[:, 2],
    mode='markers+text',
    text=labels,
    textposition="middle right",
    marker=dict(size=4),
    name='Data Points'  # Add a name for the scatter plot
)

# Create the lines for the tetrahedron edges
lines = []
for i, (start_index, end_index) in enumerate(tetrahedron_edges):
    start_point = tetrahedron_vertices[start_index]
    end_point = tetrahedron_vertices[end_index]
    lines.append(
        go.Scatter3d(
            x=[start_point[0], end_point[0]],
            y=[start_point[1], end_point[1]],
            z=[start_point[2], end_point[2]],
            mode='lines',
            line=dict(color='red', width=4),  # Style the lines
            name=f'Edge {i+1}' # Add a name for each line.
        )
    )

# Combine scatter and lines
data = [scatter] + lines

# Set the title and axis labels
layout = go.Layout(
    title="3D PCA of Data with Tetrahedron",
    scene=dict(
        xaxis_title="PC 1",
        yaxis_title="PC 2",
        zaxis_title="PC 3",
    ),
    margin=dict(l=0, r=0, b=0, t=0),
    showlegend=True #show legend
)

# Create the figure
fig = go.Figure(data=data, layout=layout)

# Show the plot (this will display an interactive plot in Colab)
fig.show()


In [None]:
# Scaterplot, P-Tet, and P-sphere: 5V-fin
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from sklearn.decomposition import PCA
from io import StringIO

# Your new data as a string (including header row)
data_string = """datapoint,V1,V2,V3,V4,V5
Pa,0,1,1,0,1
P1,1,0,0,1,1
P2,0,1,1,0,0
P3,0,1,1,0,1
Q?,1,0,1,1,0
Ax,0,0,1,0,0
Bv,1,0,1,1,0
Cx,0,0,1,1,0
Dx,1,0,1,1,1
"""

# Load the data from the string into a Pandas DataFrame
df = pd.read_csv(StringIO(data_string))

# Extract the data for PCA (exclude the 'datapoint' column)
X = df.iloc[:, 1:].values  # Get values from columns V1 to V5

# Extract the datapoint labels
labels = df['datapoint'].tolist()

# Perform PCA
pca = PCA(n_components=3)
reduced_data = pca.fit_transform(X)

# Get the indices of P1, P2, P3, and Q?
p1_index = labels.index('P1')
p2_index = labels.index('P2')
p3_index = labels.index('P3')
q_index = labels.index('Q?')

# Extract the 3D coordinates of P1, P2, P3, and Q?
p1_coords = reduced_data[p1_index]
p2_coords = reduced_data[p2_index]
p3_coords = reduced_data[p3_index]
q_coords = reduced_data[q_index]

# Define the vertices of the tetrahedron (also the points defining the sphere)
sphere_defining_points = [p1_coords, p2_coords, p3_coords, q_coords]
tetrahedron_vertices = np.array(sphere_defining_points)


# --- Functions to calculate sphere center and radius ---
def get_sphere_coeffs(p1, p2, p3, p4):
    """
    Calculates the coefficients A, B, C, D of the sphere equation:
    x^2 + y^2 + z^2 + Ax + By + Cz + D = 0
    passing through four points p1, p2, p3, p4.
    """
    points = np.array([p1, p2, p3, p4])
    M = np.ones((4, 4))
    M[:, 0] = points[:, 0]  # x coordinates
    M[:, 1] = points[:, 1]  # y coordinates
    M[:, 2] = points[:, 2]  # z coordinates
    rhs = -(points[:, 0]**2 + points[:, 1]**2 + points[:, 2]**2)
    try:
        coeffs = np.linalg.solve(M, rhs)
        return coeffs
    except np.linalg.LinAlgError:
        print("Error: The four points might be coplanar. Cannot determine a unique sphere.")
        return None

def calculate_sphere_center_radius(p1, p2, p3, p4):
    """
    Calculates the center and radius of a sphere passing through four points.
    Returns (center, radius) or (None, None) if points are coplanar.
    """
    coeffs = get_sphere_coeffs(np.array(p1), np.array(p2), np.array(p3), np.array(p4))
    if coeffs is None:
        return None, None
    A, B, C, D_coeff = coeffs
    xc = -A / 2
    yc = -B / 2
    zc = -C / 2
    center = np.array([xc, yc, zc])
    radius_sq = xc**2 + yc**2 + zc**2 - D_coeff
    if radius_sq < 0:
        # Check if points are nearly collinear/coplanar leading to large A,B,C and thus large xc,yc,zc
        # This can sometimes happen with PCA reduced data if variance is very low in one dimension
        # or if points are numerically challenging.
        # A very small positive radius_sq might also be an issue if it's due to precision.
        print(f"Warning: Calculated radius squared is negative or very small ({radius_sq}). Points might be problematic.")
        # Allow small positive radius, but flag if strictly negative
        if radius_sq < -1e-9: # Tolerance for floating point errors
             print("Error: Strictly negative radius squared. Cannot form sphere.")
             return None, None
        elif radius_sq < 0: # If slightly negative, treat as zero radius (problematic point configuration)
            print("Adjusting negative radius squared to zero for calculation, but the sphere will be a point.")
            radius_sq = 0

    radius = np.sqrt(radius_sq)
    return center, radius

# --- Function to generate Plotly sphere surface ---
def get_plotly_sphere_surface(center, radius, color='rgba(0,180,255,0.3)', resolution=50, name='Circumsphere'):
    """
    Generates Plotly go.Surface data for a sphere.
    """
    theta = np.linspace(0, 2 * np.pi, resolution)
    phi = np.linspace(0, np.pi, resolution)
    theta, phi = np.meshgrid(theta, phi)
    x = center[0] + radius * np.cos(theta) * np.sin(phi)
    y = center[1] + radius * np.sin(theta) * np.sin(phi)
    z = center[2] + radius * np.cos(phi)

    try:
        rgb_color_part = color.split('(')[1].split(')')[0].split(',')
        r, g, b = rgb_color_part[0], rgb_color_part[1], rgb_color_part[2]
        plotly_color = f'rgb({r},{g},{b})'
        opacity_val = float(rgb_color_part[3]) if len(rgb_color_part) > 3 else 0.3
    except:
        plotly_color = 'rgb(0,180,255)' # Default blue
        opacity_val = 0.3

    return go.Surface(
        x=x, y=y, z=z,
        colorscale=[[0, plotly_color], [1, plotly_color]],
        showscale=False,
        opacity=opacity_val,
        name=name,
        hoverinfo='skip'
    )

# --- Plotting ---

# Create the 3D scatter plot for all points
scatter = go.Scatter3d(
    x=reduced_data[:, 0],
    y=reduced_data[:, 1],
    z=reduced_data[:, 2],
    mode='markers+text',
    text=labels,
    textposition="middle right",
    marker=dict(size=4, color='black'), # Adjusted size and color
    name='Data Points'
)

# Define the edges of the tetrahedron (indices of vertices in tetrahedron_vertices)
tetrahedron_edges_indices = [
    (0, 1), (0, 2), (0, 3),  # Edges from P1
    (1, 2), (1, 3),          # Edges from P2
    (2, 3)                   # Edge from P3
]

# Create the lines for the tetrahedron edges
lines = []
for i, (start_idx, end_idx) in enumerate(tetrahedron_edges_indices):
    start_point = tetrahedron_vertices[start_idx]
    end_point = tetrahedron_vertices[end_idx]
    trace = go.Scatter3d(
        x=[start_point[0], end_point[0]],
        y=[start_point[1], end_point[1]],
        z=[start_point[2], end_point[2]],
        mode='lines',
        line=dict(color='magenta', width=3), # Changed color for variety
        name='Tetrahedron Edge'
    )
    # Show legend only for the first edge trace to avoid clutter
    if i == 0:
        trace.showlegend = True
    else:
        trace.showlegend = False
    lines.append(trace)

# Calculate sphere center and radius using the PCA-reduced coordinates
sphere_center, sphere_radius = calculate_sphere_center_radius(p1_coords, p2_coords, p3_coords, q_coords)

# Initialize data list for the figure
data_traces = [scatter] + lines

# Add sphere to the plot if calculation was successful
sphere_trace = None
if sphere_center is not None and sphere_radius is not None:
    if np.isnan(sphere_center).any() or np.isnan(sphere_radius):
        print("Sphere calculation resulted in NaN values. Sphere will not be plotted.")
    elif sphere_radius < 1e-6 : # If radius is effectively zero
        print(f"Sphere radius is very small ({sphere_radius}). Plotting as a point at center {sphere_center} (or skipping).")
        # Optionally, plot a marker for the center if radius is too small
        # data_traces.append(go.Scatter3d(x=[sphere_center[0]], y=[sphere_center[1]], z=[sphere_center[2]],
        #                                  mode='markers', marker=dict(size=5, color='white'), name='SCtr'))
    else:
        print(f"Sphere Center (PCA coords): {sphere_center}")
        print(f"Sphere Radius (PCA coords): {sphere_radius}")
        sphere_trace = get_plotly_sphere_surface(
            sphere_center,
            sphere_radius,
            color='rgba(100, 180, 255, 0.3)', # blueish, semi-transparent
            resolution=40,
            name='Circumsphere (P1-P2-P3-Q?)'
        )
        data_traces.append(sphere_trace)
else:
    print("Could not calculate sphere parameters (e.g., coplanar points). Sphere will not be plotted.")


# Set the title and axis labels
layout = go.Layout(
    title="3D PCA: Data Points, Tetrahedron, and Circumsphere",
    scene=dict(
        xaxis_title="PC 1",
        yaxis_title="PC 2",
        zaxis_title="PC 3",
        aspectmode='data' # Options: 'data', 'cube', 'auto', 'manual'
    ),
    margin=dict(l=0, r=0, b=0, t=0),
    showlegend=True
)

# Create the figure
fig = go.Figure(data=data_traces, layout=layout)

# Show the plot
fig.show()