Index:
1. Latex to text

    (a) preprocessing latex code
    
    (b) converting latex to text

2. Word Frequency Estimation

    (a) Word Frequency Estimation using Counter class from collections library

    (b) Word Frequency Estimation from scratch

3. Exporting data to csv file
4. Stemming
5. Tagging
6. English frequencies from wikipedia



# 1. Latex to text

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Preprocessing latex code

In [None]:
import re

In [None]:
latex_path = "/content/drive/MyDrive/SPIDER/docs/Paper.tex"

with open(latex_path, 'r') as file:
    content = file.read()

In [None]:
# comments
content = re.sub(r'(?<!\\)%.*', '', content)

# tables
content = re.sub(r'\\begin{table}.*?\\end{table}', '', content, flags=re.DOTALL)

# figures
content = re.sub(r'\\begin{figure}.*?\\end{figure}', '', content, flags=re.DOTALL)

# \begin{figure*} ... \end{figure*}
content = re.sub(r'\\begin{figure\*}.*?\\end{figure\*}', '', content, flags=re.DOTALL)

# equations
content = re.sub(r'\\begin{equation}.*?\\end{equation}', '', content, flags=re.DOTALL)

# \begin{CCSXML} ... \end{CCSXML}
content = re.sub(r'\\begin{CCSXML}.*?\\end{CCSXML}', '', content, flags=re.DOTALL)

# ~\ref{...} & ~\cite{...}
content = re.sub(r'~\\ref{.*?}', '', content)
content = re.sub(r'~\\cite{.*?}', '', content)

# print(content) # filtered latex code will be printed here

### Converting latex to text

In [None]:
!pip install pylatexenc



In [None]:
from pylatexenc.latex2text import LatexNodes2Text

In [None]:
content = LatexNodes2Text().latex_to_text(content) # latex to text conversion

# lets remove extra spaces between lines
content = re.sub(r'(\n\s*){2,}', r'\n', content)

print(content)


copyrightspace
arpe@itu.dk
IT University of Copenhagen
  Rued Langgaards Vej 7
  Copenhagen
  Denmark
  2300
luai@itu.dk
IT University of Copenhagen
  Rued Langgaards Vej 7
  Copenhagen
  Denmark
  2300
Public discourse on critical issues such as climate change is progressively shifting to social media platforms that prioritize short-form video content. To improve our understanding of this transition, we studied the video content produced by 21 prominent YouTube creators who have expanded their influence to TikTok as information disseminators. Using dictionary-based tools and BERT-based embeddings, we analyzed the transcripts of nearly 7k climate-related videos across both platforms and the 574k comments they received. We found that, when using TikTok, creators use a more emotionally resonant, self-referential, and action-oriented language compared to YouTube. We also observed a strong semantic alignment between videos and comments, with creators who excel at diversifying their TikTok

In [None]:
# path to output file
text_path = 'modified_file.txt'

with open(text_path, 'w') as file:
    file.write(content)

# 2. Word Frequency Estimation

### (a) Word Frequency Estimation using Counter class from collections library

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
from nltk.tokenize import word_tokenize as wt
from collections import Counter
import string
import pandas as pd
import itertools

In [None]:
def calc_word_freq(text):
    # removing punctuation (: . , ! ? ( ) - ... )
    text = text.translate(str.maketrans('', '', string.punctuation))

    words = wt(text)

    freqs = Counter(words)

    return freqs

In [None]:
data = {}

with open(text_path) as file:
  for i in itertools.count():  # itertools.count() just iterates till infinity
    text = file.readline()

    if not text:  # it checks if next line exists or not
      break

    freqs_1 = dict(calc_word_freq(text))  # output is a counter object but can be converted to other formats (e.g. list, set, etc.)

    data.update(freqs_1) # appending to main dictionary

print(data)

print(f"\nTotal words: {len(data)}\n")

{'copyrightspace': 1, 'arpeitudk': 1, 'IT': 1, 'University': 1, 'of': 4, 'Copenhagen': 1, 'Rued': 1, 'Langgaards': 1, 'Vej': 1, '7': 1, 'Denmark': 1, '2300': 1, 'luaiitudk': 1, 'Public': 1, 'discourse': 1, 'on': 1, 'critical': 1, 'issues': 1, 'such': 1, 'as': 2, 'climate': 2, 'change': 1, 'is': 1, 'progressively': 1, 'shifting': 1, 'to': 1, 'social': 1, 'media': 1, 'platforms': 1, 'that': 1, 'prioritize': 1, 'shortform': 1, 'video': 1, 'content': 2, 'To': 1, 'improve': 1, 'our': 1, 'understanding': 1, 'this': 2, 'transition': 1, 'we': 1, 'studied': 1, 'the': 5, 'produced': 1, 'by': 1, '21': 1, 'prominent': 1, 'YouTube': 1, 'creators': 3, 'who': 1, 'have': 1, 'expanded': 1, 'their': 1, 'influence': 1, 'TikTok': 3, 'information': 2, 'disseminators': 1, 'Using': 1, 'dictionarybased': 1, 'tools': 1, 'and': 5, 'BERTbased': 1, 'embeddings': 1, 'analyzed': 1, 'transcripts': 2, 'nearly': 1, '7k': 1, 'climaterelated': 2, 'videos': 2, 'across': 1, 'both': 1, '574k': 1, 'comments': 2, 'they': 1, 

### (b) Word Frequency Estimation from scratch

This part was taken from Abdul Basit's code and updated to read file line by line.

In [None]:
def count_frequency(path) :
    dictionary = {}

    with open(path, "r") as file:

      line = file.readline()

      while line:
        # removing punctuation (: . , ! ? ( ) - ... )
        text = line.translate(str.maketrans('', '', string.punctuation))

        tokens = wt(text)

        for token in tokens:
          if token in dictionary:
            dictionary[token] = dictionary[token] + 1
          else:
            dictionary[token] = 1

        line = file.readline()

    return dictionary

In [None]:
freqs_2 = count_frequency(text_path)

print(freqs_2)
print(f"\nTotal words: {len(freqs_2)}\n")

{'copyrightspace': 1, 'arpeitudk': 1, 'IT': 2, 'University': 2, 'of': 100, 'Copenhagen': 4, 'Rued': 2, 'Langgaards': 2, 'Vej': 2, '7': 2, 'Denmark': 2, '2300': 2, 'luaiitudk': 1, 'Public': 1, 'discourse': 10, 'on': 55, 'critical': 3, 'issues': 1, 'such': 6, 'as': 17, 'climate': 27, 'change': 20, 'is': 11, 'progressively': 1, 'shifting': 1, 'to': 72, 'social': 4, 'media': 2, 'platforms': 29, 'that': 21, 'prioritize': 1, 'shortform': 1, 'video': 27, 'content': 47, 'To': 11, 'improve': 2, 'our': 9, 'understanding': 3, 'this': 10, 'transition': 1, 'we': 30, 'studied': 2, 'the': 150, 'produced': 3, 'by': 11, '21': 4, 'prominent': 2, 'YouTube': 17, 'creators': 30, 'who': 6, 'have': 7, 'expanded': 1, 'their': 22, 'influence': 2, 'TikTok': 19, 'information': 3, 'disseminators': 2, 'Using': 1, 'dictionarybased': 1, 'tools': 3, 'and': 100, 'BERTbased': 1, 'embeddings': 5, 'analyzed': 2, 'transcripts': 9, 'nearly': 1, '7k': 1, 'climaterelated': 4, 'videos': 29, 'across': 9, 'both': 9, '574k': 1, 

In [None]:
len(freqs_1), len(freqs_2) # both give similar results

(955, 955)

# 3. Exporting data to csv file

In [None]:
# Convert the dictionary to a DataFrame
df = pd.DataFrame(list(freqs_1.items()), columns=['Word', 'Frequency'])

df

Unnamed: 0,Word,Frequency
0,copyrightspace,1
1,arpeitudk,1
2,IT,2
3,University,2
4,of,100
...,...,...
950,performs,1
951,crossplatform,1
952,delves,1
953,relationship,1


In [None]:
csv_path = "EngFreqs.csv"

df.to_csv(csv_path, index=False)

# 4. Stemming

In [None]:
from nltk.stem import PorterStemmer

In [None]:
ps = PorterStemmer()

def apply_stem(token):
  return ps.stem(token)

In [None]:
tokens = wt(content)
print(tokens)



In [None]:
stemmed_tokens = list(map(apply_stem, tokens))

s_df = pd.DataFrame({"original_tokens": tokens, "stemmed_tokens": stemmed_tokens})
s_df

Unnamed: 0,original_tokens,stemmed_tokens
0,copyrightspace,copyrightspac
1,arpe,arp
2,@,@
3,itu.dk,itu.dk
4,IT,it
...,...,...
3276,content,content
3277,and,and
3278,reactions,reaction
3279,.,.


# 5. Tagging

In [None]:
pos_tags = nltk.pos_tag(tokens)

print(pos_tags)



In [None]:
# lets filter the tags to get only nouns
# NLTK has following tags for nouns
# NN: Noun, singular or mass
# NNS: Noun, plural
# NNP: Proper noun, singular
# NNPS: Proper noun, plural
tags = ["NN", "NNS", "NNP", "NNPS"]

def filter_tags(token):
  return token[1] in tags # if the tag is in tags list, it will return True

In [None]:
filtered_tags = list(filter(filter_tags, pos_tags))
print(filtered_tags)



In [None]:
# # function to get tokens and tags separately
# full_forms = {
#   "NN": "Noun, singular or mass",
#   "NNS": "Noun, plural",
#   "NNP": "Proper noun, singular",
#   "NNPS": "Proper noun, plural",
# }

# def get_ingredients(token):
#   return token[0], full_forms[token[1]] # full form just so that everyone understands what it is

In [None]:
# ingredients = list(map(get_ingredients, filtered_tags)) # returns map of Tuple(Token, Tag)

tokens, tags = zip(*filtered_tags) # Unpacks Token and Tag

final_df = pd.DataFrame({"Token": tokens, "Tag": tags})
final_df

Unnamed: 0,Token,Tag
0,copyrightspace,NN
1,arpe,NN
2,@,NNP
3,itu.dk,NN
4,IT,NNP
...,...,...
1078,YouTube,NNP
1079,relationship,NN
1080,content,NN
1081,reactions,NNS


In [None]:
# Suggestion:
# We could perform Frequency Estimatioin first to get unique tokens
# and then we can perform Stemming and Tagging on them to reduce time required for code execution

# 6. English Frequencies from wikipedia

In [None]:
!pip install wikipedia



In [None]:
import wikipedia

In [None]:
wiki_obj = wikipedia.search("BERT: Transformer")
wiki_obj

['BERT (language model)',
 'Transformer (machine learning model)',
 'Vision transformer',
 'Generative pre-trained transformer',
 'Ashish Vaswani',
 'Bert',
 'Tesla coil',
 'GPT-3',
 'Hugging Face',
 'Perceiver']

In [None]:
def get_page(topic):
  try:
    wiki_obj = wikipedia.search(topic)

    if(len(wiki_obj) <= 0):
      raise ValueError("Nothing in Wikipedia Search!")

    page = wikipedia.page(wiki_obj[0])
    return page

  except Exception as e:
    # print(e)
    return None

In [None]:
page = get_page("BERT: Transformer")
page.title

'BERT (language model)'

In [None]:
topics = [
  "General intelligence",
  "Reasoning and Problem Solving",
  "Knowledge Representation",
  "Planning",
  "Learning",
  "Natural language processing",
  "Perception",
  "Robotics",
  "AI accelerators",
  "Ambient intelligence"
]

In [None]:
# lets try to get all data
def get_corpus(topic):
  page = get_page(topic)

  if not page:
    return ""

  return page.content

In [None]:
corpus = list(map(get_corpus, topics))
corpus



  lis = BeautifulSoup(html).find_all('li')


['An artificial general intelligence (AGI) is a hypothetical type of intelligent agent. If realized, an AGI could learn to accomplish any intellectual task that human beings or animals can perform. Alternatively, AGI has been defined as an autonomous system that surpasses human capabilities in the majority of economically valuable tasks. Creating AGI is a primary goal of some artificial intelligence research and of companies such as OpenAI, DeepMind, and Anthropic. AGI is a common topic in science fiction and futures studies.\nThe timeline for AGI development remains a subject of ongoing debate among researchers and experts. Some argue that it may be possible in years or decades; others maintain it might take a century or longer; and a minority believe it may never be achieved. There is debate on the exact definition of AGI, and regarding whether modern large language models (LLMs) such as GPT-4 are early yet incomplete forms of AGI.\nContention exists over the potential for AGI to pos

In [None]:
# lets save the corpus in a file
corpus = " ".join(corpus) # join list elements to create a whole text

with open("corpus.txt", "w") as file:
  file.write(corpus)

In [None]:
eng_freqs = calc_word_freq(corpus)
print(eng_freqs)

print(f"\nTotal Words: {len(eng_freqs)}")


Total Words: 7006


# Links

1. Latex file: https://drive.google.com/file/d/1Cqb1hEsCSlwjxhlp_g7xBI0dCmOWHn5Q/view?usp=sharing

( The latex file is taken from the paper: https://arxiv.org/abs/2312.04974 )