<h1 style='font-size:40px'> Texts Common Origin Analysis</h1>
<div> 
    <ul style='font-size:20px'> 
        <li>
            The present project aims at conceiving a ML system designed for verifying whether two pieces of texts actually came from the same article.
        </li>
        <li>
            The corpus we are dealing with has been extracted from Wikipedia by Professor Jeff Heaton and published at <a href='https://www.kaggle.com/datasets/jeffheaton/are-two-sentences-of-the-same-topic'>Kaggle</a>.
        </li>
    </ul>
 </div>

In [3]:
import pandas as pd
import numpy as np

# Altering pandas' default maximum column width.
pd.set_option('display.max_colwidth', 50)

# Loading the file.
df = pd.read_csv('data/train.csv').drop('id', axis=1)
df.head()

Unnamed: 0,sent1,sent2,same_source
0,"June – Moctezuma II, Aztec ruler of Tenochtitl...",The Swedish regent Sten Sture the Younger is m...,1
1,"The population was 1,097 at the 2010 census.",Like other Latino neighborhoods in New York Ci...,0
2,Europe and the Islamic World: A History.,There are no plans to resurrect it.,0
3,"Even where only a small charge is produced, it...",The Clarion-Limestone Area School District pro...,0
4,The highlight of Croatias recent infrastructur...,The closest analogy with the modern Web browse...,0


<h2 style='font-size:30px'> Dataset Overview</h2>
<div> 
    <ul style='font-size:20px'> 
        <li>
            Now let's explore the texts - especially those from the same articles - so that we can think about a sensible strategy to succeed in this challenge.
        </li>
    </ul>
 </div>

In [4]:
# Note that we are dealing with a well balanced dataset.
df['same_source'].value_counts(normalize=True)

same_source
1    0.500488
0    0.499512
Name: proportion, dtype: float64

In [5]:
# Now, I've considered convenient to create a function that prints out
# instances from each target.
from pyboxen import boxen
from typing import Literal

def boxen_samples(df:pd.DataFrame, target:Literal[0, 1], sample_size:int=5, random_size:int=42)->None:
    '''
        Prints out some pairs of texts from a certain target in a `pyboxen.boxen` format.

        Note: Texts from the positive class are displayed inside green boxes, whereas the ones
        from the negative target in red ones.
        
        Parameters
        ----------
        `df`: `pd.DataFrame`
            The project's dataset.
        `target`: Literal[0, 1]
            The target group from which to make samples.
        `sample_size`: int, defaults to 5
            The amount of instances to print.
        `random_state`: int, defaults to 42
            The sampling random state.
    '''
    color = 'green' if target == 1 else 'red' # Defining the boxen's color according to the wished target.
    df_sample = df[df['same_source']==target].sample(sample_size, random_state=random_size)

    for i, series in df_sample.iterrows():
        print(boxen(f'Sentence 1: {series["sent1"]} \n\nSentence 2: {series["sent2"]}', title=f'Row {i}', color=color))

In [6]:
# Instances from the positive class.
boxen_samples(df, 1)

[32m╭─[0m[32m Row 97148 [0m[32m────────────────────────────────────────────────────────────────────────────────────────────────────[0m[32m─╮[0m
[32m│[0mSentence 1: The racial makeup of the town was 97.56% White, 0.41% African American, 0.18% Native American, 0.41% [32m│[0m
[32m│[0mAsian, 0.28% from other races, and 1.15% from two or more races.                                                 [32m│[0m
[32m│[0m                                                                                                                 [32m│[0m
[32m│[0mSentence 2: Originally Plantation Number 9 by the Court of Massachusetts Bay, Huntington has a colorful history, [32m│[0m
[32m│[0mhinted at by the towns incorporation date of March 5, 1855, decades later than the towns around it.              [32m│[0m
[32m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯[0m



[32m╭─[0m[32m Row 107031 [0m[32m───────────────────────────────────────────────────────────────────────────────────────────────────[0m[32m─╮[0m
[32m│[0m Sentence 1: Camelot is a castle and court associated with the legendary King Arthur.                            [32m│[0m
[32m│[0m                                                                                                                 [32m│[0m
[32m│[0m Sentence 2: The name of the Romano-British town of Camulodunum (modern Colchester) was derived from the Celtic  [32m│[0m
[32m│[0m god Camulus.                                                                                                    [32m│[0m
[32m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯[0m



[32m╭─[0m[32m Row 6853 [0m[32m─────────────────────────────────────────────────────────────────────────────────────────────────────[0m[32m─╮[0m
[32m│[0mSentence 1: Celestial historian Richard Allen noted that unlike the other constellations introduced by Plancius  [32m│[0m
[32m│[0mand La Caille, Phoenix has actual precedent in ancient astronomy, as the Arabs saw this formation as representing[32m│[0m
[32m│[0myoung ostriches, Al Riāl, or as a griffin or eagle.                                                              [32m│[0m
[32m│[0m                                                                                                                 [32m│[0m
[32m│[0mSentence 2: BD is of spectral type A1V, and ranges between magnitudes 5.90 and 5.94.                             [32m│[0m
[32m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯[0m



[32m╭─[0m[32m Row 5112 [0m[32m────────────────────────────────────────────────────────────────────────[0m[32m─╮[0m                             
[32m│[0mSentence 1: Hispanic or Latino of any race were 0.81% of the population.            [32m│[0m                             
[32m│[0m                                                                                    [32m│[0m                             
[32m│[0mSentence 2: There were 87 housing units at an average density of 2.4/sqmi (0.9/km²).[32m│[0m                             
[32m╰────────────────────────────────────────────────────────────────────────────────────╯[0m                             



[32m╭─[0m[32m Row 24352 [0m[32m────────────────────────────────────────────────────────────────────────────────────────────────────[0m[32m─╮[0m
[32m│[0m Sentence 1: The population density was 4,519.4 people per square mile (1,621.6/km²).                            [32m│[0m
[32m│[0m                                                                                                                 [32m│[0m
[32m│[0m Sentence 2: It is bordered to the west by Goose Creek, to the northwest by Barbourmeade, to the north by Manor  [32m│[0m
[32m│[0m Creek, to the east by Ten Broeck, and to the south by a portion of Louisville.                                  [32m│[0m
[32m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯[0m



In [7]:
# Instances from the negative class.
boxen_samples(df, 0)

[31m╭─[0m[31m Row 48222 [0m[31m────────────────────────────────────────────────────────────────────────────────────────────────────[0m[31m─╮[0m
[31m│[0mSentence 1: In addition, it is to be noted there was a 1-month period in which another woman served as the Acting[31m│[0m
[31m│[0mPostmaster for Gratz during Ms. Suters tenure, perhaps due to a short leave of absence for Suter and, that acting[31m│[0m
[31m│[0mrole was held by Roberta G Minish from December 21, 1927 until January 18, 1928 when Ms. Suter returned to her   [31m│[0m
[31m│[0mposition as Postmaster.                                                                                          [31m│[0m
[31m│[0m                                                                                                                 [31m│[0m
[31m│[0mSentence 2: At the same time, the Chinese army of Ganzhou reconquers Turpan in Northern Xiongnu.                 [31m│[0m
[31m╰──────────────────────────────────────

[31m╭─[0m[31m Row 90996 [0m[31m───────────────────────────────────────────────────────[0m[31m─╮[0m                                             
[31m│[0mSentence 1: Sid and Nancy (1986), by Alex Cox.                      [31m│[0m                                             
[31m│[0m                                                                    [31m│[0m                                             
[31m│[0mSentence 2: South New Castle is located at  (40.975430, -80.344624).[31m│[0m                                             
[31m╰────────────────────────────────────────────────────────────────────╯[0m                                             



[31m╭─[0m[31m Row 86016 [0m[31m────────────────────────────────────────────[0m[31m─╮[0m                                                        
[31m│[0mSentence 1: Hard fall for man who had it all.            [31m│[0m                                                        
[31m│[0m                                                         [31m│[0m                                                        
[31m│[0mSentence 2: The population was 11,545 at the 2010 census.[31m│[0m                                                        
[31m╰─────────────────────────────────────────────────────────╯[0m                                                        



[31m╭─[0m[31m Row 15535 [0m[31m────────────────────────────────────────────────────────────────────────────────────────────────────[0m[31m─╮[0m
[31m│[0m  Sentence 1: For every 100 females, there were 96.3 males.                                                      [31m│[0m
[31m│[0m                                                                                                                 [31m│[0m
[31m│[0m  Sentence 2: The median income for a household in the CDP was $24,679, and the median income for a family was   [31m│[0m
[31m│[0m  $31,719.                                                                                                       [31m│[0m
[31m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯[0m



[31m╭─[0m[31m Row 29459 [0m[31m────────────────────────────────────────────────────────────────────────────────────────────────────[0m[31m─╮[0m
[31m│[0m  Sentence 1: For every 100 females, there were 118.3 males.                                                     [31m│[0m
[31m│[0m                                                                                                                 [31m│[0m
[31m│[0m  Sentence 2: As of the census of 2000, there were 226 people, 96 households, and 67 families residing in the    [31m│[0m
[31m│[0m  township.                                                                                                      [31m│[0m
[31m╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯[0m



<h3 style='font-size:30px;font-style:italic'> Sample Analysis</h3>
<div> 
    <ul style='font-size:20px'>
        <li> 
            We can observe that most fragments sampled from the same article touches on similar topics (demographics, history), and so tend to resort to typical words from that knowledge area.
        </li>
        <li>
            But we need to be aware that sometimes such documents may use jargons that are very specific to their field ("A1V"). A model might not be able to reckognize such term-theme relation if the word is rare in the corpus.
        </li>
        <li>
            And at last, note that there are some texts that although appear to be of the same subject, they do not belong to the same article! Just look at the last two red boxes that talk about with demographics. Since correctly classifying those cases would be much more intricate even for a human, we need to accept a certain percentage of errors from our system.
        </li>
     </ul>
</div>

<h2 style='font-size:30px'> Strategy Explanation</h2>
<div> 
    <ul style='font-size:20px'> 
        <li>
            Considering that same article fragments tend to discuss similar topics, I think that proceeding a topic modeling would highly suit our mission.
        </li>
        <li>
            We'd do so by running any well-known method (NMF or Latent Dirichlet Allocation) over our bag-of-words matrix. Then, when given a pair of texts, we would compare the topic association scores between them to conclude whether they originated from the same document. 
        </li>
    </ul>
 </div>

<h2 style='font-size:30px'> Data Treatment</h2>
<div> 
    <ul style='font-size:20px'> 
        <li>
            Before attempting any solution, we must turn our corpus into a matrix so that we can apply ML techniques.
        </li>
    </ul>
 </div>

<h3 style='font-size:30px;font-style:italic'> Dataset Split</h3>
<div> 
    <ul style='font-size:20px'>
        <li> 
            In order to legitimize any quality conclusion about our study, we must separate an exclusive dataset for tests.
        </li>
     </ul>
</div>

In [92]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, :-1], df.iloc[:, -1], train_size=.8, stratify=df.iloc[:, -1], random_state=42)

In [74]:
from statsmodels.stats.proportion import test_proportions_2indep

def _assert_equal_prop(y:pd.Series, alpha:float=.05)->bool:
    '''
        Verifies whether the sets produced by `sklearn.model_selection.train_test_split` have a 
        target distribution similar to the original dataset.

        Parameter
        ---------
        `y`: `pd.Series`
            The target distribution of a given dataset.
        `alpha`: float
            The significance level of our hypothesis test.

        Returns
        -------
        A boolean signaling if the distributions are statistically similar.
    '''
    y_counts = y.value_counts()
    df_counts = df['same_source'].value_counts()
    pvalue = test_proportions_2indep(df_counts[1], df_counts.sum(), y_counts[1], y_counts.sum()).pvalue
    if pvalue<alpha:
        return False
    return True

In [118]:
# The Hypothesis Tests assert that the split successfully maintained similar target distributions
# along the datasets.
try:
    _assert_equal_prop(y_train)
    _assert_equal_prop(y_test)
    print('Distributions Ok')
except:
    print('Distributions are not statistically similar')

Distributions are not statistically similar


<h3 style='font-size:30px;font-style:italic'> Vectorizing the Texts</h3>
<div> 
    <ul style='font-size:20px'>
        <li> 
            To make possible the use of any ML technique over our data, we must turn it into a matrix. In our case, I'll apply an standard TF-IDF.
        </li>
     </ul>
</div>

In [90]:
from nltk.corpus import stopwords
from nltk import download

stop_words = stopwords.words('english')

In [98]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet, stopwords
from typing import List

class LemmaTokenizer:
    '''
        Lemmatizer to be used as the `tokenizer` argument in the 
        `sklearn.feature_extraction.text.TfidfVectorizer` class. It tokenizes  the string and 
        applies lemmatization, according to the WordNet Pos-Tagging.
    '''
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    
    @staticmethod
    def get_wordnet_pos(treebank_tag):
        '''
            Converts a Tree Bank TAG into a WordNet TAG.

            Parameter
            ---------
            `treebank_tag`: str
                The Tree Bank TAG

            Returns
            -------
            The converted TAG.`
        '''
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            return wordnet.NOUN
        
    def __call__(self, doc)->List[str]:
        tokens = word_tokenize(doc)
        tokens_tags = pos_tag(tokens)
        return [self.wnl.lemmatize(token, pos=self.get_wordnet_pos(pos)) for token, pos in tokens_tags]

In [119]:
# It would be benefitial to leverage the texts from both columns to carry out the topic modeling.
train_texts = np.concatenate((X_train.iloc[:, 0].to_numpy(), X_train.iloc[:, 1].to_numpy()))
train_texts

array(['Retrieved on December 20, 2008.',
       'The population density was 18 people per square mile (7/km²).',
       'In contrast, successors to the illustrative approach, such as Gil Kane, found their work eventually reach an impasse.',
       ...,
       'Louis also had to abandon claims on fiefdoms in Mecklenburg and Pomerania.',
       'Hispanic or Latino of any race were 2.69% of the population.',
       'Their expertise is in the examination of evidence or relevant facts in the case.'],
      dtype=object)

In [122]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import save_npz

# tf_idf = TfidfVectorizer(tokenizer=LemmaTokenizer(), stop_words=stop_words, strip_accents='ascii')
# X_train_lda = tf_idf.fit_transform(train_texts)
save_npz('data/train-lda.npz', X_train_lda)

In [114]:
! git add .
! git commit -m 'Rodar LDA'
! git push

[master 9509509] Fitar o TF-IDF com o conteúdo de ambas as colunas
 4 files changed, 485 insertions(+), 65 deletions(-)
Enumerating objects: 11, done.
Counting objects: 100% (11/11), done.
Delta compression using up to 24 threads
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 7.20 KiB | 7.20 MiB/s, done.
Total 6 (delta 3), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (3/3), completed with 3 local objects.[K
To https://github.com/felipesveiga/texts-same-article.git
   c54b2aa..9509509  master -> master


<p style='color:red'> Expliquei estratégia do projeto; Vetorizei set de treino; Rodar LDA</p>