# Postings generation
In this notebook it is presented the preprocesing necesary to create the posting for each token in the collection of jokes dataset.

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords') # for comon english words

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\limal\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Data structure
The [Short Jokes](https://www.kaggle.com/datasets/abhinavmoudgil95/short-jokes) dataset containfs over 200,000 short jokes of different topics

In [2]:
data = pd.read_csv('shortjokes.csv')
ids = data['ID']
jokes = data.values[:,1]
data.head()

Unnamed: 0,ID,Joke
0,1,"[me narrating a documentary about narrators] ""..."
1,2,Telling my daughter garlic is good for you. Go...
2,3,I've been going through a really rough period ...
3,4,"If I could have dinner with anyone, dead or al..."
4,5,Two guys walk into a bar. The third guy ducks.


## Data preprosesing
It is designed a method for preprocesing inputs. Given a set of `stop words` (common english words), this method eliminates symbols, lower the text and eliminates stop words, producing a tokenized version of the input text.

In [10]:
def tokenize_text(text, stop_words):
    # eliminate symbols on input text
    sims = "!\"#$%&()*+-.,'/:;<=>?@[\]^_`{|}~\n"
    for si in sims:
        text = text.replace(si, '')

    # lower text
    text = text.lower()

    # eliminate stop words
    words = text.split(' ')
    filtered_words = []
    for wi in words:
        if wi not in stop_words:
            filtered_words.append(wi)

    return filtered_words

# et of stop-words
stop_words = set(stopwords.words('english'))
ind = 98 # example index
input = data["Joke"].values[ind]
print(f'original: {input}')
print(f'tokenizado: {tokenize_text(input, stop_words)}')

original: [uses the restroom] Wife: make sure to put the toilet seat down Me: okay Me: [to toilet seat] you're worthless and nobody likes you
tokenizado: ['uses', 'restroom', 'wife', 'make', 'sure', 'put', 'toilet', 'seat', 'okay', 'toilet', 'seat', 'youre', 'worthless', 'nobody', 'likes']


## Postings creation
Each joke is procesed with the `tokenize_text` method to construct the postings dictionary, where each key represent a posible token, and its value contains the set of index jokes which cointain this token.

In [11]:
stop_words = set(stopwords.words('english'))

postings = {}
for i, jokei in enumerate(jokes):
    # tokens from jokei
    tokens = tokenize_text(jokei, stop_words) 

    # iterates over key words
    for ti in tokens:
        if ti not in postings.keys(): # if the key word is not in postings dictionary
            postings[ti] = set() # create an empty set to the new keyword
        postings[ti].add(i) # agregate the index of the joke

print('Example of keys and values:')
for ki in list(postings.keys())[:10]:
    print(f'{ki}: {postings[ki]}')

Example of keys and values:
narrating: {0, 60930, 142343, 183202, 31422, 84547, 133831, 171472, 157020, 179686, 228712, 94191, 56304, 115826, 42355, 12915, 88311, 229496, 10234, 173307}
documentary: {0, 60930, 187397, 142343, 79372, 182806, 80921, 131610, 171033, 21532, 61986, 186405, 208934, 223783, 2607, 186931, 219703, 40510, 146495, 14915, 189511, 212040, 110155, 16461, 223829, 89176, 55906, 112739, 214631, 10355, 16499, 87681, 149128, 209032, 44181, 224923, 113821, 150686, 164, 196773, 93864, 53938, 40633, 103102, 47299, 82633, 204492, 63182, 199377, 93396, 24794, 146141, 80096, 41193, 114927, 61177, 173307, 176386, 190728, 209677, 184592, 85779, 64276, 175389, 195869, 37151, 83232, 69411, 156451, 146725, 21289, 224045, 174391, 12626, 182100, 165209, 89954, 214883, 198500, 153448, 42355, 141683, 150902, 14226, 138663, 22960, 70582, 204730, 62907, 174016, 210372, 81357, 171472, 164317, 139753, 106986, 138731, 106995, 22004, 47093, 10234, 199676}
narrators: {0, 139844, 164454}
cant:

Finally, the [postings.txt](postings.txt) file is created with the postings information:

In [9]:
with open('postings.txt', 'w') as t:
    for ki in postings.keys():
        t.write(f'{ki} ')
        for i, ind in enumerate(postings[ki]):
            if i < len(postings[ki])-1:
                t.write(f'{ind},')
            else:
                t.write(f'{ind}\n')
