This notebook performs more calculations for the macrophage, including a column that includes the proportion of English words in the sentence (English words == words included in a separate pre-determined dictionary). Also includes a column with the number of English words in the sentence (and total number of words in the sentence). Finally, includes four columns for fraction of characters in the sentence that are alphabetic, numeric, punctuation, or other.

For this notebook, you will need:
*   CombinedDictionary.txt (combination of 75k MainDictionary.txt words, 50k last names, all girl names, all boy names)
*   The tsv file from Macrophage Stage 1
*   Top500words.txt (first 500 words of the MainDictionary.txt)


The output of this notebook will be a tsv file with the following headers:


*   Filename
*   previous two sentences
*   sentence (the sentence we are interested in)
*   next two sentences
*   whether or not fasttext determined the sentence was English (T/F)
*   the English probability for the sentence
*   The non-English language with the highest fasttext probability
*   The probability for that non-English language (will outperform English if the sentence is False for English)
*   Center-distance
*   dict_words
*   all_words
*   dict_proportion
*   frac_alphabetic
*   frac_numeric
*   frac_punctuation
*   frac_other
*   T500_proportion



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import re
import string

In [3]:
# This function uses built-in python functions to count the number of alphabetic characters,
# numeric characters, punctuation characters, and anything that doesn't fit into those three categories.
# Then it returns the fraction of each sentence that falls into those categories.

def calculate_fractions(sentence):
    total_chars = len(sentence)
    if total_chars == 0:
        return 0, 0, 0, 0

    alphabet_count = sum(c.isalpha() for c in sentence)
    number_count = sum(c.isdigit() for c in sentence)
    punct_count = sum(c in string.punctuation for c in sentence)
    other_count = total_chars - (alphabet_count + number_count + punct_count)

    return (alphabet_count / total_chars,
            number_count / total_chars,
            punct_count / total_chars,
            other_count / total_chars)

In [4]:
# This function takes the combined dictionary (75k english words + 50k last names + all girl names + all
# boy names) and creates three new columns in the tsv originally created with fasttext -- looks for any
# words that overlap with the dict, non-overlapping words, and proportion of overlapping to all words.

with open('/content/CombinedDictionary.txt', 'r', encoding='utf-8') as file:
    dictionary_words = set(word.strip().lower() for word in file.readlines())

# Function to remove punctuation from before or after a word in a sentence, so that it can be matched to a
# word in the dictionary
def clean_word(word):
    return re.sub(r'[^\w\s]', '', word).lower()

# This uses the clean_word function to clean each word, then adds up the number of words in the sentence
# that match a word in the dictionary. Then calculates proportion of words in the sentence that are found
# in the dictionary and makes columns for this in the new tsv file.
def calculate_dict_metrics(sentence):
    words = sentence.split()
    cleaned_words = [clean_word(word) for word in words]
    dict_words = sum(1 for word in cleaned_words if word in dictionary_words)
    #non_dict_count = len(cleaned_words) - dict_count
    #proportion = dict_count / non_dict_count if non_dict_count > 0 else 1
    proportion= dict_words / len(cleaned_words)
    all_words= len(cleaned_words)
    return dict_words, all_words, proportion


In [5]:
# This function takes the top 500 REAL ENGLISH WORDS and returns the proportion of
# words from the sentence that are in this real english set to to all words in the sentence.

with open('/content/Top500words.txt', 'r', encoding='utf-8') as file:
    dictionary_words2 = set(word.strip().lower() for word in file.readlines())

# Function to remove punctuation from before or after a word in a sentence, so that it can be matched to a
# word in the dictionary
def clean_word(word):
    return re.sub(r'[^\w\s]', '', word).lower()

# This uses the clean_word function to clean each word, then adds up the number of words in the sentence
# that match a word in the dictionary. Then calculates the proportion of words in the sentence that are found
# in the dictionary and makes columns for this in the new TSV file.
def calculate_500dict_metrics(sentence):
    words = sentence.split()
    cleaned_words = [clean_word(word) for word in words]
    dict_words = sum(1 for word in cleaned_words if word in dictionary_words2)
    proportion = dict_words / len(cleaned_words) if len(cleaned_words) > 0 else 0
    return proportion


In [6]:
# Actually running the dictionary and fraction functions

df = pd.read_csv('/content/drive/MyDrive/UIUC_Summer2024/RA_Underwood/GPT1914/all_hathi_sents_macro1.tsv', sep='\t')

#First applying the dictionary function
df[['dict_words', 'all_words', 'dict_proportion']] = df['sent'].apply(lambda x: pd.Series(calculate_dict_metrics(x)))

# Then applying the alph/num/punct/other fractions function
df[['frac_alphabetic', 'frac_numeric', 'frac_punctuation', 'frac_other']] = df['sent'].apply(
    lambda x: pd.Series(calculate_fractions(x))
)

# Finally applying the T500_proportion function
df['T500_proportion'] = df['sent'].apply(calculate_500dict_metrics)

print("Added columns for CombinedDictionary, fraction functions, and Top500 words dictionary.")

# Master TSV file with everything
df.to_csv('/content/drive/MyDrive/UIUC_Summer2024/RA_Underwood/GPT1914/all_hathi_sents_macro2.tsv', sep='\t', index=False)

print("Created Master File.")

Added columns for CombinedDictionary, fraction functions, and Top500 words dictionary.
Created Master File.


In [13]:
df.head(5)

Unnamed: 0,file,prev_sent,sent,next_sent,is_english,english_prob,non_english_lang,non_english_prob,center_dist,dict_words,all_words,dict_proportion,frac_alphabetic,frac_numeric,frac_punctuation,frac_other,T500_proportion
0,coo.31924055997609.norm.txt,,Fortune Bros. Brewing Co Fortune Bros Brewin...,Gottfried Brewirvg Great Western \ ine Co na...,True,0.321631,__label__de,0.182048,0.5,17.0,21.0,0.809524,0.759124,0.0,0.043796,0.19708,0.142857
1,coo.31924055997609.norm.txt,Fortune Bros. Brewing Co Fortune Bros Brewing...,Gottfried Brewirvg Great Western \ ine Co na...,(og American Malting Coi and Bliss st iglaltin...,True,0.383198,__label__de,0.087019,0.499979,6.0,10.0,0.6,0.783333,0.0,0.033333,0.183333,0.2
2,coo.31924055997609.norm.txt,Fortune Bros. Brewing Co Fortune Bros Brewing...,(og American Malting Coi and Bliss st iglaltinfz.,". ’ ' , . 1 1 av rewe 11339; i§'ÍÈ§'.’ ‚о 110...",True,0.567239,__label__tr,0.032386,0.499957,6.0,8.0,0.75,0.816327,0.0,0.040816,0.142857,0.125
3,coo.31924055997609.norm.txt,Gottfried Brewirvg Great Western \ ine Co na...,". ’ ' , .",1 1 av rewe 11339; i§'ÍÈ§'.’ ‚о 110 ‘\l.e1’eâ...,False,0.000127,__label__diq,0.201745,0.499936,0.0,5.0,0.0,0.0,0.0,0.444444,0.555556,0.0
4,coo.31924055997609.norm.txt,(og American Malting Coi and Bliss st iglaltin...,1 1 av rewe 11339; i§'ÍÈ§'.’ ‚о 110 ‘\l.e1’eâ...,". . , Z301 Wallace st.. . [Soda water Illinoi...",False,0.020663,__label__no,0.239044,0.499914,4.0,15.0,0.266667,0.511905,0.130952,0.095238,0.261905,0.0


Now that we have our master file created, we can cut it down a bit and filter it. The most helpful columns for our training will be:

*   Filename
*   previous two sentences
*   sentence (the sentence we are interested in)
*   next two sentences
*   whether or not fasttext determined the sentence was English (T/F)
*   the English probability for the sentence
*   The probability for that non-English language (will outperform English if the sentence is False for English)
*   Center_distance
*   dict_proportion
*   frac_other
*   T500_proportion

And the helpful filters for our training data will be:
*   low English probability (determined by fasttext) OR
*   low dict_proportion (the proportion of words in the CombinedDictionary.txt) OR
*   high frac_other (fraction of characters that aren't alphanumeric or punctuation)

In [8]:
# First, we drop the columns we don't need:

drop_cols = ['non_english_lang', 'dict_words', 'frac_alphabetic', 'frac_numeric', 'frac_punctuation', 'all_words']

df1 = df.drop(columns=drop_cols, axis=1)

df1.head(5)

Unnamed: 0,file,prev_sent,sent,next_sent,is_english,english_prob,non_english_prob,center_dist,dict_proportion,frac_other,T500_proportion
0,coo.31924055997609.norm.txt,,Fortune Bros. Brewing Co Fortune Bros Brewin...,Gottfried Brewirvg Great Western \ ine Co na...,True,0.321631,0.182048,0.5,0.809524,0.19708,0.142857
1,coo.31924055997609.norm.txt,Fortune Bros. Brewing Co Fortune Bros Brewing...,Gottfried Brewirvg Great Western \ ine Co na...,(og American Malting Coi and Bliss st iglaltin...,True,0.383198,0.087019,0.499979,0.6,0.183333,0.2
2,coo.31924055997609.norm.txt,Fortune Bros. Brewing Co Fortune Bros Brewing...,(og American Malting Coi and Bliss st iglaltinfz.,". ’ ' , . 1 1 av rewe 11339; i§'ÍÈ§'.’ ‚о 110...",True,0.567239,0.032386,0.499957,0.75,0.142857,0.125
3,coo.31924055997609.norm.txt,Gottfried Brewirvg Great Western \ ine Co na...,". ’ ' , .",1 1 av rewe 11339; i§'ÍÈ§'.’ ‚о 110 ‘\l.e1’eâ...,False,0.000127,0.201745,0.499936,0.0,0.555556,0.0
4,coo.31924055997609.norm.txt,(og American Malting Coi and Bliss st iglaltin...,1 1 av rewe 11339; i§'ÍÈ§'.’ ‚о 110 ‘\l.e1’eâ...,". . , Z301 Wallace st.. . [Soda water Illinoi...",False,0.020663,0.239044,0.499914,0.266667,0.261905,0.0


In [9]:
df1.head(1)

Unnamed: 0,file,prev_sent,sent,next_sent,is_english,english_prob,non_english_prob,center_dist,dict_proportion,frac_other,T500_proportion
0,coo.31924055997609.norm.txt,,Fortune Bros. Brewing Co Fortune Bros Brewin...,Gottfried Brewirvg Great Western \ ine Co na...,True,0.321631,0.182048,0.5,0.809524,0.19708,0.142857


In [28]:
# Then, we filter for the criteria we want:

#  english_prob of less than 0.55 OR
#  dict_prob of less than 0.5 OR
#  frac_other of greater than 0.6.
#  sentences of at least 20 characters


df_filtered = df1[((df1['english_prob'] < 0.55) | (df1['dict_proportion'] < 0.50) | (df1['frac_other'] > 0.60))]

sample1 = df_filtered.sample(n=90, random_state=1)

df_filtered1 = df_filtered[df_filtered['sent'].str.len() >= 20]

sample2 = df_filtered1.sample(n=90, random_state=1)

df_filtered1.head()

Unnamed: 0,file,prev_sent,sent,next_sent,is_english,english_prob,non_english_prob,center_dist,dict_proportion,frac_other,T500_proportion
0,coo.31924055997609.norm.txt,,Fortune Bros. Brewing Co Fortune Bros Brewin...,Gottfried Brewirvg Great Western \ ine Co na...,True,0.321631,0.182048,0.5,0.809524,0.19708,0.142857
1,coo.31924055997609.norm.txt,Fortune Bros. Brewing Co Fortune Bros Brewing...,Gottfried Brewirvg Great Western \ ine Co na...,(og American Malting Coi and Bliss st iglaltin...,True,0.383198,0.087019,0.499979,0.6,0.183333,0.2
4,coo.31924055997609.norm.txt,(og American Malting Coi and Bliss st iglaltin...,1 1 av rewe 11339; i§'ÍÈ§'.’ ‚о 110 ‘\l.e1’eâ...,". . , Z301 Wallace st.. . [Soda water Illinoi...",False,0.020663,0.239044,0.499914,0.266667,0.261905,0.0
5,coo.31924055997609.norm.txt,". ’ ' , . 1 1 av rewe 11339; i§'ÍÈ§'.’ ‚о 110...",". . , Z301 Wallace st.. .",[Soda water Illinois Brewing & Маши: Со... i...,True,0.163391,0.08473,0.499893,0.285714,0.24,0.0
6,coo.31924055997609.norm.txt,1 1 av rewe 11339; i§'ÍÈ§'.’ ‚о 110 ‘\l.e1’eâ...,[Soda water Illinois Brewing & Маши: Со... i...,‘Brewery 118 . ’ t |31 w. 01110 м.,True,0.386981,0.031745,0.499872,0.466667,0.168317,0.066667


In [29]:
df_suspicious_sentences = pd.concat([sample1, sample2], ignore_index=True)

df_suspicious_sentences.to_csv('/content/suspicious_sentences.tsv', sep='\t', index=False)