# Ejercicio 5: Modelo Probabilístico

## Objetivo de la práctica
- Aplicar paso a paso técnicas de preprocesamiento, evaluando el impacto de cada etapa en el número de tokens y en el vocabulario final.

## Parte 0: Carga del Corpus

In [32]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data

## Parte 1: Tokenización

### Actividad
1. Tokeniza los documentos.

In [33]:
import pandas as pd

In [53]:
newsgroups_df = pd.DataFrame(newsgroupsdocs).reindex()
newsgroups_df.rename(columns={0: 'doc'}, inplace=True)
newsgroups_df

Unnamed: 0,doc
0,\n\nI am sure some bashers of Pens fans are pr...
1,My brother is in the market for a high-perform...
2,\n\n\n\n\tFinally you said what you dream abou...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...
4,1) I have an old Jasmine drive which I cann...
...,...
18841,DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...
18842,\nNot in isolated ground recepticles (usually ...
18843,I just installed a DX2-66 CPU in a clone mothe...
18844,\nWouldn't this require a hyper-sphere. In 3-...


In [54]:
newsgroups_df['doc']

0        \n\nI am sure some bashers of Pens fans are pr...
1        My brother is in the market for a high-perform...
2        \n\n\n\n\tFinally you said what you dream abou...
3        \nThink!\n\nIt's the SCSI card doing the DMA t...
4        1)    I have an old Jasmine drive which I cann...
                               ...                        
18841    DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...
18842    \nNot in isolated ground recepticles (usually ...
18843    I just installed a DX2-66 CPU in a clone mothe...
18844    \nWouldn't this require a hyper-sphere.  In 3-...
18845    After a tip from Gary Crum (crum@fcom.cc.utah....
Name: doc, Length: 18846, dtype: object

In [55]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /home/murder/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [56]:
from nltk.tokenize import word_tokenize
newsgroups_df['tokens'] = newsgroups_df['doc'].apply(word_tokenize)
newsgroups_df

Unnamed: 0,doc,tokens
0,\n\nI am sure some bashers of Pens fans are pr...,"[I, am, sure, some, bashers, of, Pens, fans, a..."
1,My brother is in the market for a high-perform...,"[My, brother, is, in, the, market, for, a, hig..."
2,\n\n\n\n\tFinally you said what you dream abou...,"[Finally, you, said, what, you, dream, about, ..."
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,"[Think, !, It, 's, the, SCSI, card, doing, the..."
4,1) I have an old Jasmine drive which I cann...,"[1, ), I, have, an, old, Jasmine, drive, which..."
...,...,...
18841,DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...,"[DN, >, From, :, nyeda, @, cnsvax.uwec.edu, (,..."
18842,\nNot in isolated ground recepticles (usually ...,"[Not, in, isolated, ground, recepticles, (, us..."
18843,I just installed a DX2-66 CPU in a clone mothe...,"[I, just, installed, a, DX2-66, CPU, in, a, cl..."
18844,\nWouldn't this require a hyper-sphere. In 3-...,"[Would, n't, this, require, a, hyper-sphere, ...."


In [57]:
newsgroups_df['tokens'][0]

['I',
 'am',
 'sure',
 'some',
 'bashers',
 'of',
 'Pens',
 'fans',
 'are',
 'pretty',
 'confused',
 'about',
 'the',
 'lack',
 'of',
 'any',
 'kind',
 'of',
 'posts',
 'about',
 'the',
 'recent',
 'Pens',
 'massacre',
 'of',
 'the',
 'Devils',
 '.',
 'Actually',
 ',',
 'I',
 'am',
 'bit',
 'puzzled',
 'too',
 'and',
 'a',
 'bit',
 'relieved',
 '.',
 'However',
 ',',
 'I',
 'am',
 'going',
 'to',
 'put',
 'an',
 'end',
 'to',
 'non-PIttsburghers',
 "'",
 'relief',
 'with',
 'a',
 'bit',
 'of',
 'praise',
 'for',
 'the',
 'Pens',
 '.',
 'Man',
 ',',
 'they',
 'are',
 'killing',
 'those',
 'Devils',
 'worse',
 'than',
 'I',
 'thought',
 '.',
 'Jagr',
 'just',
 'showed',
 'you',
 'why',
 'he',
 'is',
 'much',
 'better',
 'than',
 'his',
 'regular',
 'season',
 'stats',
 '.',
 'He',
 'is',
 'also',
 'a',
 'lot',
 'fo',
 'fun',
 'to',
 'watch',
 'in',
 'the',
 'playoffs',
 '.',
 'Bowman',
 'should',
 'let',
 'JAgr',
 'have',
 'a',
 'lot',
 'of',
 'fun',
 'in',
 'the',
 'next',
 'couple',
 '

In [58]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(newsgroupsdocs)
tokens = vectorizer.get_feature_names_out()
print(f"Número de tokens: {len(tokens)}")
print(f"Tokens: {tokens}")

Número de tokens: 134410
Tokens: ['00' '000' '0000' ... '³ation' 'ýé' 'ÿhooked']


## Parte 2: Normalización

### Actividad
1. Convierte todos los tokens a minúsculas.
2. Elimina puntuación y símbolos no alfabéticos.

In [59]:
# Eliminar puntuación y símbolos no alfabéticos
tokens = [word for word in newsgroups_df['tokens'][0] if word.isalpha()]
print(f"Número de tokens: {len(tokens)}")
print(f"Tokens: {tokens}")

Número de tokens: 136
Tokens: ['I', 'am', 'sure', 'some', 'bashers', 'of', 'Pens', 'fans', 'are', 'pretty', 'confused', 'about', 'the', 'lack', 'of', 'any', 'kind', 'of', 'posts', 'about', 'the', 'recent', 'Pens', 'massacre', 'of', 'the', 'Devils', 'Actually', 'I', 'am', 'bit', 'puzzled', 'too', 'and', 'a', 'bit', 'relieved', 'However', 'I', 'am', 'going', 'to', 'put', 'an', 'end', 'to', 'relief', 'with', 'a', 'bit', 'of', 'praise', 'for', 'the', 'Pens', 'Man', 'they', 'are', 'killing', 'those', 'Devils', 'worse', 'than', 'I', 'thought', 'Jagr', 'just', 'showed', 'you', 'why', 'he', 'is', 'much', 'better', 'than', 'his', 'regular', 'season', 'stats', 'He', 'is', 'also', 'a', 'lot', 'fo', 'fun', 'to', 'watch', 'in', 'the', 'playoffs', 'Bowman', 'should', 'let', 'JAgr', 'have', 'a', 'lot', 'of', 'fun', 'in', 'the', 'next', 'couple', 'of', 'games', 'since', 'the', 'Pens', 'are', 'going', 'to', 'beat', 'the', 'pulp', 'out', 'of', 'Jersey', 'anyway', 'I', 'was', 'very', 'disappointed', 'not

In [60]:
newsgroups_df['tokens'][0]

['I',
 'am',
 'sure',
 'some',
 'bashers',
 'of',
 'Pens',
 'fans',
 'are',
 'pretty',
 'confused',
 'about',
 'the',
 'lack',
 'of',
 'any',
 'kind',
 'of',
 'posts',
 'about',
 'the',
 'recent',
 'Pens',
 'massacre',
 'of',
 'the',
 'Devils',
 '.',
 'Actually',
 ',',
 'I',
 'am',
 'bit',
 'puzzled',
 'too',
 'and',
 'a',
 'bit',
 'relieved',
 '.',
 'However',
 ',',
 'I',
 'am',
 'going',
 'to',
 'put',
 'an',
 'end',
 'to',
 'non-PIttsburghers',
 "'",
 'relief',
 'with',
 'a',
 'bit',
 'of',
 'praise',
 'for',
 'the',
 'Pens',
 '.',
 'Man',
 ',',
 'they',
 'are',
 'killing',
 'those',
 'Devils',
 'worse',
 'than',
 'I',
 'thought',
 '.',
 'Jagr',
 'just',
 'showed',
 'you',
 'why',
 'he',
 'is',
 'much',
 'better',
 'than',
 'his',
 'regular',
 'season',
 'stats',
 '.',
 'He',
 'is',
 'also',
 'a',
 'lot',
 'fo',
 'fun',
 'to',
 'watch',
 'in',
 'the',
 'playoffs',
 '.',
 'Bowman',
 'should',
 'let',
 'JAgr',
 'have',
 'a',
 'lot',
 'of',
 'fun',
 'in',
 'the',
 'next',
 'couple',
 '

In [61]:
words = [word.lower() for word in tokens]
print(f"Número de tokens: {len(words)}")
print(f"Tokens: {words}")

Número de tokens: 136
Tokens: ['i', 'am', 'sure', 'some', 'bashers', 'of', 'pens', 'fans', 'are', 'pretty', 'confused', 'about', 'the', 'lack', 'of', 'any', 'kind', 'of', 'posts', 'about', 'the', 'recent', 'pens', 'massacre', 'of', 'the', 'devils', 'actually', 'i', 'am', 'bit', 'puzzled', 'too', 'and', 'a', 'bit', 'relieved', 'however', 'i', 'am', 'going', 'to', 'put', 'an', 'end', 'to', 'relief', 'with', 'a', 'bit', 'of', 'praise', 'for', 'the', 'pens', 'man', 'they', 'are', 'killing', 'those', 'devils', 'worse', 'than', 'i', 'thought', 'jagr', 'just', 'showed', 'you', 'why', 'he', 'is', 'much', 'better', 'than', 'his', 'regular', 'season', 'stats', 'he', 'is', 'also', 'a', 'lot', 'fo', 'fun', 'to', 'watch', 'in', 'the', 'playoffs', 'bowman', 'should', 'let', 'jagr', 'have', 'a', 'lot', 'of', 'fun', 'in', 'the', 'next', 'couple', 'of', 'games', 'since', 'the', 'pens', 'are', 'going', 'to', 'beat', 'the', 'pulp', 'out', 'of', 'jersey', 'anyway', 'i', 'was', 'very', 'disappointed', 'not

In [62]:
from nltk import regexp_tokenize
tokens = regexp_tokenize(str(newsgroups_df['doc'][0]), pattern=r'\w+')
print(f"Número de tokens: {len(tokens)}")
print(f"Tokens: {tokens}")

Número de tokens: 138
Tokens: ['I', 'am', 'sure', 'some', 'bashers', 'of', 'Pens', 'fans', 'are', 'pretty', 'confused', 'about', 'the', 'lack', 'of', 'any', 'kind', 'of', 'posts', 'about', 'the', 'recent', 'Pens', 'massacre', 'of', 'the', 'Devils', 'Actually', 'I', 'am', 'bit', 'puzzled', 'too', 'and', 'a', 'bit', 'relieved', 'However', 'I', 'am', 'going', 'to', 'put', 'an', 'end', 'to', 'non', 'PIttsburghers', 'relief', 'with', 'a', 'bit', 'of', 'praise', 'for', 'the', 'Pens', 'Man', 'they', 'are', 'killing', 'those', 'Devils', 'worse', 'than', 'I', 'thought', 'Jagr', 'just', 'showed', 'you', 'why', 'he', 'is', 'much', 'better', 'than', 'his', 'regular', 'season', 'stats', 'He', 'is', 'also', 'a', 'lot', 'fo', 'fun', 'to', 'watch', 'in', 'the', 'playoffs', 'Bowman', 'should', 'let', 'JAgr', 'have', 'a', 'lot', 'of', 'fun', 'in', 'the', 'next', 'couple', 'of', 'games', 'since', 'the', 'Pens', 'are', 'going', 'to', 'beat', 'the', 'pulp', 'out', 'of', 'Jersey', 'anyway', 'I', 'was', 'ver

In [64]:
newsgroups_df['regex_tokens'] = newsgroups_df['doc'].str.lower().apply(regexp_tokenize, pattern=r'\w[a-z]+')
newsgroups_df

Unnamed: 0,doc,tokens,regex_tokens
0,\n\nI am sure some bashers of Pens fans are pr...,"[I, am, sure, some, bashers, of, Pens, fans, a...","[am, sure, some, bashers, of, pens, fans, are,..."
1,My brother is in the market for a high-perform...,"[My, brother, is, in, the, market, for, a, hig...","[my, brother, is, in, the, market, for, high, ..."
2,\n\n\n\n\tFinally you said what you dream abou...,"[Finally, you, said, what, you, dream, about, ...","[finally, you, said, what, you, dream, about, ..."
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,"[Think, !, It, 's, the, SCSI, card, doing, the...","[think, it, the, scsi, card, doing, the, dma, ..."
4,1) I have an old Jasmine drive which I cann...,"[1, ), I, have, an, old, Jasmine, drive, which...","[have, an, old, jasmine, drive, which, cannot,..."
...,...,...,...
18841,DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...,"[DN, >, From, :, nyeda, @, cnsvax.uwec.edu, (,...","[dn, from, nyeda, cnsvax, uwec, edu, david, ny..."
18842,\nNot in isolated ground recepticles (usually ...,"[Not, in, isolated, ground, recepticles, (, us...","[not, in, isolated, ground, recepticles, usual..."
18843,I just installed a DX2-66 CPU in a clone mothe...,"[I, just, installed, a, DX2-66, CPU, in, a, cl...","[just, installed, dx, cpu, in, clone, motherbo..."
18844,\nWouldn't this require a hyper-sphere. In 3-...,"[Would, n't, this, require, a, hyper-sphere, ....","[wouldn, this, require, hyper, sphere, in, spa..."


In [65]:
for row in newsgroups_df.iterrows():
    print(row[1])

doc             \n\nI am sure some bashers of Pens fans are pr...
tokens          [I, am, sure, some, bashers, of, Pens, fans, a...
regex_tokens    [am, sure, some, bashers, of, pens, fans, are,...
Name: 0, dtype: object
doc             My brother is in the market for a high-perform...
tokens          [My, brother, is, in, the, market, for, a, hig...
regex_tokens    [my, brother, is, in, the, market, for, high, ...
Name: 1, dtype: object
doc             \n\n\n\n\tFinally you said what you dream abou...
tokens          [Finally, you, said, what, you, dream, about, ...
regex_tokens    [finally, you, said, what, you, dream, about, ...
Name: 2, dtype: object
doc             \nThink!\n\nIt's the SCSI card doing the DMA t...
tokens          [Think, !, It, 's, the, SCSI, card, doing, the...
regex_tokens    [think, it, the, scsi, card, doing, the, dma, ...
Name: 3, dtype: object
doc             1)    I have an old Jasmine drive which I cann...
tokens          [1, ), I, have, an, old, Jasmine, 

## Parte 3: Eliminación de Stopwords

### Actividad
1. Elimina las palabras vacías usando una lista estándar.

In [None]:
!pip install nltk



In [71]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/murder/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [80]:
from nltk.corpus import stopwords
stw = stopwords.words('english')
stw

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [81]:
# def remove_stopwords(tokens):
#     sw = stw.words('english')
#     for w in sw:
#         if w in tokens:
#             tokens.remove(w)
#     return tokens
from nltk.corpus import stopwords

def remove_stopwords(tokens):
    sw = set(stopwords.words('english'))
    return [w for w in tokens if w not in sw]

In [82]:
newsgroups_df['sw_tokens'] = newsgroups_df['regex_tokens'].apply(remove_stopwords)
newsgroups_df

Unnamed: 0,doc,tokens,regex_tokens,sw_tokens
0,\n\nI am sure some bashers of Pens fans are pr...,"[I, am, sure, some, bashers, of, Pens, fans, a...","[am, sure, some, bashers, of, pens, fans, are,...","[sure, bashers, pens, fans, pretty, confused, ..."
1,My brother is in the market for a high-perform...,"[My, brother, is, in, the, market, for, a, hig...","[my, brother, is, in, the, market, for, high, ...","[brother, market, high, performance, video, ca..."
2,\n\n\n\n\tFinally you said what you dream abou...,"[Finally, you, said, what, you, dream, about, ...","[finally, you, said, what, you, dream, about, ...","[finally, said, dream, mediterranean, new, are..."
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,"[Think, !, It, 's, the, SCSI, card, doing, the...","[think, it, the, scsi, card, doing, the, dma, ...","[think, scsi, card, dma, transfers, disks, scs..."
4,1) I have an old Jasmine drive which I cann...,"[1, ), I, have, an, old, Jasmine, drive, which...","[have, an, old, jasmine, drive, which, cannot,...","[old, jasmine, drive, cannot, use, new, system..."
...,...,...,...,...
18841,DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...,"[DN, >, From, :, nyeda, @, cnsvax.uwec.edu, (,...","[dn, from, nyeda, cnsvax, uwec, edu, david, ny...","[dn, nyeda, cnsvax, uwec, edu, david, nye, dn,..."
18842,\nNot in isolated ground recepticles (usually ...,"[Not, in, isolated, ground, recepticles, (, us...","[not, in, isolated, ground, recepticles, usual...","[isolated, ground, recepticles, usually, unusu..."
18843,I just installed a DX2-66 CPU in a clone mothe...,"[I, just, installed, a, DX2-66, CPU, in, a, cl...","[just, installed, dx, cpu, in, clone, motherbo...","[installed, dx, cpu, clone, motherboard, tried..."
18844,\nWouldn't this require a hyper-sphere. In 3-...,"[Would, n't, this, require, a, hyper-sphere, ....","[wouldn, this, require, hyper, sphere, in, spa...","[require, hyper, sphere, space, points, specif..."


## Parte 4: Stemming o Lematización

### Actividad
1. Aplica stemming.
2. Aplica lematización.
3. Compara ambas técnicas.

In [83]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/murder/nltk_data...


True

In [84]:
from nltk.stem import WordNetLemmatizer as wnl
wnl().lemmatize('dogs')

'dog'

In [85]:
tok = newsgroups_df.loc[0].doc.split()

In [86]:
[wnl().lemmatize(t) for t in newsgroups_df.loc[0].sw_tokens]

['sure',
 'bashers',
 'pen',
 'fan',
 'pretty',
 'confused',
 'lack',
 'kind',
 'post',
 'recent',
 'pen',
 'massacre',
 'devil',
 'actually',
 'bit',
 'puzzled',
 'bit',
 'relieved',
 'however',
 'going',
 'put',
 'end',
 'non',
 'pittsburghers',
 'relief',
 'bit',
 'praise',
 'pen',
 'man',
 'killing',
 'devil',
 'worse',
 'thought',
 'jagr',
 'showed',
 'much',
 'better',
 'regular',
 'season',
 'stats',
 'also',
 'lot',
 'fo',
 'fun',
 'watch',
 'playoff',
 'bowman',
 'let',
 'jagr',
 'lot',
 'fun',
 'next',
 'couple',
 'game',
 'since',
 'pen',
 'going',
 'beat',
 'pulp',
 'jersey',
 'anyway',
 'disappointed',
 'see',
 'islander',
 'lose',
 'final',
 'regular',
 'season',
 'game',
 'pen',
 'rule']

In [87]:
wnl().lemmatize('bashers')

'bashers'

In [93]:
' '.join(newsgroups_df.loc[0].sw_tokens)

'sure bashers pens fans pretty confused lack kind posts recent pens massacre devils actually bit puzzled bit relieved however going put end non pittsburghers relief bit praise pens man killing devils worse thought jagr showed much better regular season stats also lot fo fun watch playoffs bowman let jagr lot fun next couple games since pens going beat pulp jersey anyway disappointed see islanders lose final regular season game pens rule'

In [94]:
def lemmatize(tokens):
  return [wnl().lemmatize(t) for t in tokens]

In [95]:
newsgroups_df['lem_tokens'] = newsgroups_df['sw_tokens'].apply(lemmatize)
newsgroups_df

Unnamed: 0,doc,tokens,regex_tokens,sw_tokens,lem_tokens,prep_doc
0,\n\nI am sure some bashers of Pens fans are pr...,"[I, am, sure, some, bashers, of, Pens, fans, a...","[am, sure, some, bashers, of, pens, fans, are,...","[sure, bashers, pens, fans, pretty, confused, ...","[sure, bashers, pen, fan, pretty, confused, la...",sure bashers pen fan pretty confused lack kind...
1,My brother is in the market for a high-perform...,"[My, brother, is, in, the, market, for, a, hig...","[my, brother, is, in, the, market, for, high, ...","[brother, market, high, performance, video, ca...","[brother, market, high, performance, video, ca...",brother market high performance video card sup...
2,\n\n\n\n\tFinally you said what you dream abou...,"[Finally, you, said, what, you, dream, about, ...","[finally, you, said, what, you, dream, about, ...","[finally, said, dream, mediterranean, new, are...","[finally, said, dream, mediterranean, new, are...",finally said dream mediterranean new area grea...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,"[Think, !, It, 's, the, SCSI, card, doing, the...","[think, it, the, scsi, card, doing, the, dma, ...","[think, scsi, card, dma, transfers, disks, scs...","[think, scsi, card, dma, transfer, disk, scsi,...",think scsi card dma transfer disk scsi card dm...
4,1) I have an old Jasmine drive which I cann...,"[1, ), I, have, an, old, Jasmine, drive, which...","[have, an, old, jasmine, drive, which, cannot,...","[old, jasmine, drive, cannot, use, new, system...","[old, jasmine, drive, cannot, use, new, system...",old jasmine drive cannot use new system unders...
...,...,...,...,...,...,...
18841,DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...,"[DN, >, From, :, nyeda, @, cnsvax.uwec.edu, (,...","[dn, from, nyeda, cnsvax, uwec, edu, david, ny...","[dn, nyeda, cnsvax, uwec, edu, david, nye, dn,...","[dn, nyeda, cnsvax, uwec, edu, david, nye, dn,...",dn nyeda cnsvax uwec edu david nye dn neurolog...
18842,\nNot in isolated ground recepticles (usually ...,"[Not, in, isolated, ground, recepticles, (, us...","[not, in, isolated, ground, recepticles, usual...","[isolated, ground, recepticles, usually, unusu...","[isolated, ground, recepticles, usually, unusu...",isolated ground recepticles usually unusual co...
18843,I just installed a DX2-66 CPU in a clone mothe...,"[I, just, installed, a, DX2-66, CPU, in, a, cl...","[just, installed, dx, cpu, in, clone, motherbo...","[installed, dx, cpu, clone, motherboard, tried...","[installed, dx, cpu, clone, motherboard, tried...",installed dx cpu clone motherboard tried mount...
18844,\nWouldn't this require a hyper-sphere. In 3-...,"[Would, n't, this, require, a, hyper-sphere, ....","[wouldn, this, require, hyper, sphere, in, spa...","[require, hyper, sphere, space, points, specif...","[require, hyper, sphere, space, point, specifi...",require hyper sphere space point specifies sph...


In [96]:
newsgroups_df['prep_doc'] = newsgroups_df['lem_tokens'].str.join(' ')
newsgroups_df

Unnamed: 0,doc,tokens,regex_tokens,sw_tokens,lem_tokens,prep_doc
0,\n\nI am sure some bashers of Pens fans are pr...,"[I, am, sure, some, bashers, of, Pens, fans, a...","[am, sure, some, bashers, of, pens, fans, are,...","[sure, bashers, pens, fans, pretty, confused, ...","[sure, bashers, pen, fan, pretty, confused, la...",sure bashers pen fan pretty confused lack kind...
1,My brother is in the market for a high-perform...,"[My, brother, is, in, the, market, for, a, hig...","[my, brother, is, in, the, market, for, high, ...","[brother, market, high, performance, video, ca...","[brother, market, high, performance, video, ca...",brother market high performance video card sup...
2,\n\n\n\n\tFinally you said what you dream abou...,"[Finally, you, said, what, you, dream, about, ...","[finally, you, said, what, you, dream, about, ...","[finally, said, dream, mediterranean, new, are...","[finally, said, dream, mediterranean, new, are...",finally said dream mediterranean new area grea...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,"[Think, !, It, 's, the, SCSI, card, doing, the...","[think, it, the, scsi, card, doing, the, dma, ...","[think, scsi, card, dma, transfers, disks, scs...","[think, scsi, card, dma, transfer, disk, scsi,...",think scsi card dma transfer disk scsi card dm...
4,1) I have an old Jasmine drive which I cann...,"[1, ), I, have, an, old, Jasmine, drive, which...","[have, an, old, jasmine, drive, which, cannot,...","[old, jasmine, drive, cannot, use, new, system...","[old, jasmine, drive, cannot, use, new, system...",old jasmine drive cannot use new system unders...
...,...,...,...,...,...,...
18841,DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...,"[DN, >, From, :, nyeda, @, cnsvax.uwec.edu, (,...","[dn, from, nyeda, cnsvax, uwec, edu, david, ny...","[dn, nyeda, cnsvax, uwec, edu, david, nye, dn,...","[dn, nyeda, cnsvax, uwec, edu, david, nye, dn,...",dn nyeda cnsvax uwec edu david nye dn neurolog...
18842,\nNot in isolated ground recepticles (usually ...,"[Not, in, isolated, ground, recepticles, (, us...","[not, in, isolated, ground, recepticles, usual...","[isolated, ground, recepticles, usually, unusu...","[isolated, ground, recepticles, usually, unusu...",isolated ground recepticles usually unusual co...
18843,I just installed a DX2-66 CPU in a clone mothe...,"[I, just, installed, a, DX2-66, CPU, in, a, cl...","[just, installed, dx, cpu, in, clone, motherbo...","[installed, dx, cpu, clone, motherboard, tried...","[installed, dx, cpu, clone, motherboard, tried...",installed dx cpu clone motherboard tried mount...
18844,\nWouldn't this require a hyper-sphere. In 3-...,"[Would, n't, this, require, a, hyper-sphere, ....","[wouldn, this, require, hyper, sphere, in, spa...","[require, hyper, sphere, space, points, specif...","[require, hyper, sphere, space, point, specifi...",require hyper sphere space point specifies sph...
