In [2]:
!pip install gensim



In [3]:
text_corpus = [
     
"A male student shot and wounded two faculty members at a Denver high school on Wednesday and then fled the scene, spurring a citywide search for his whereabouts, according to city officials.",

"The shooting at East High School was reported at about 9:50 a.m., and police and medical responders arrived on the scene “very quickly” to find two adult men with gunshot wounds, Denver Police Chief Ron Thomas said.",

"One faculty member is in stable condition, and the other is in critical condition, he said.",

"The student suspected in the shooting, who is under 18, was under a school safety plan in which he was patted down each day, the police chief said. During Wednesday’s search, a handgun was retrieved and several shots were fired in an office area in the front of the school, away from other students and staff, he said.",

"The student then fled the school, and a search is underway for his whereabouts. The weapon has not been recovered, Thomas said.",

"“We are looking for the suspect,” Mayor Michael Hancock said. “We will find that suspect, and we will hold that suspect accountable for his actions this morning in placing everyone in danger and certainly wounding the two staff members who were doing their job and trying to keep everyone safe at the time.”",

"The mayor asked residents to look out for the student, described as an African American juvenile with an afro and wearing a hoodie with an astronaut on it. The student should be considered “armed and dangerous and willing to use the weapon, as we learned this morning,” the mayor said.",

"In addition, another student was taken to a hospital because of an allergic reaction, the mayor said. Paramedics were already in the building for that incident when the shooting occurred and so were able to immediately treat those wounded, Hancock said.",

"The incident is the 18th shooting this year at an elementary or secondary school in the US in which at least one person was injured or killed, according to the Gun Violence Archive, a non-profit that attempts to log every incident of gun violence in the US in real time. Just this week, one student was shot at a high school parking lot in Dallas, and in Arlington, Texas, a student was killed and another was injured in a shooting outside a high school.",

"Wednesday also marks two years since the mass shooting at the King Soopers supermarket in Boulder, Colorado, in which 10 people were killed.",

"East High School has about 2,500 students across 9th through 12th grades and is the largest and highest-performing comprehensive high school of all Denver Public Schools, according to the school system.",

"The high school is located in the City Park neighborhood of the Colorado capital and is considered a Denver Historic Landmark for its architecture in the Jacobethan Revival style. The clock tower atop the school is similar in style to Philadelphia’s Independence Hall, the school website notes.",

"Several hours after the shooting, school officials began implementing a “controlled release” of students, according to a tweet from Denver Public Schools. Students who commuted to school will be escorted to their cars, students who rode the bus will be held on campus until their bus arrives and students who were dropped off by a parent can be picked up from a separate location, the tweet says.",

"The school will be out of session for the rest of the week following the shooting, Superintendent Alex Marrero said during a news conference. When students return and for the remainder of the school year, two armed officers would be present on campus, he said.",

"Colorado Gov. Jared Polis issued a statement saying he was praying for the two victims’ recovery.",

"“Our students should and must be able to attend school without fear for their safety, their parents deserve the peace of mind that their children are safe in classrooms, and teachers should be able to work safely and without harm,” he said."

]

In [4]:
documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

In [5]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [6]:
from pprint import pprint  # pretty-printer
from collections import defaultdict

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

pprint(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


In [7]:
from gensim import corpora
dictionary = corpora.Dictionary(texts)
dictionary.save('deerwester.dict')  # store the dictionary, for future reference
print(dictionary)

2023-03-29 19:04:42,912 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2023-03-29 19:04:42,913 : INFO : built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)
2023-03-29 19:04:42,913 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)", 'datetime': '2023-03-29T19:04:42.913066', 'gensim': '4.3.1', 'python': '3.10.4 | packaged by conda-forge | (main, Mar 30 2022, 08:38:02) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19044-SP0', 'event': 'created'}
2023-03-29 19:04:42,914 : INFO : Dictionary lifecycle event {'fname_or_handle': 'deerwester.dict', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2023-03-29T19:04:42.914067', 'gensim': '4.3.1', 'python': '3.10.4 | packaged by conda-forge | (main, Mar 30 2022, 08:38

Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...>


In [8]:
print(dictionary.token2id)

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}


In [9]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)  # the word "interaction" does not appear in the dictionary and is ignored

[(0, 1), (1, 1)]


In [11]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('deerwester.mm', corpus)  # store to disk, for later use
print(corpus)

2023-03-29 19:06:48,513 : INFO : storing corpus in Matrix Market format to deerwester.mm
2023-03-29 19:06:48,514 : INFO : saving sparse matrix to deerwester.mm
2023-03-29 19:06:48,514 : INFO : PROGRESS: saving document #0
2023-03-29 19:06:48,515 : INFO : saved 9x12 matrix, density=25.926% (28/108)
2023-03-29 19:06:48,516 : INFO : saving MmCorpus index to deerwester.mm.index


[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]


In [13]:
from smart_open import open  # for transparently opening remote files


class MyCorpus:
    def __iter__(self):
        for line in open('corpus.txt'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())

In [14]:
corpus_memory_friendly = MyCorpus()  # doesn't load the corpus into memory!
print(corpus_memory_friendly)

<__main__.MyCorpus object at 0x00000215A56A61D0>


In [15]:
for vector in corpus_memory_friendly:  # load one vector into memory at a time
    print(vector)

[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]


In [17]:
# collect statistics about all tokens
dictionary = corpora.Dictionary(line.lower().split() for line in open('corpus.txt'))
# remove stop words and words that appear only once
stop_ids = [
    dictionary.token2id[stopword]
    for stopword in stoplist
    if stopword in dictionary.token2id
]
once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.items() if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids)  # remove stop words and words that appear only once
dictionary.compactify()  # remove gaps in id sequence after words that were removed
print(dictionary)

2023-03-29 19:10:51,271 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2023-03-29 19:10:51,272 : INFO : built Dictionary<42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...> from 9 documents (total 69 corpus positions)
2023-03-29 19:10:51,273 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...> from 9 documents (total 69 corpus positions)", 'datetime': '2023-03-29T19:10:51.273096', 'gensim': '4.3.1', 'python': '3.10.4 | packaged by conda-forge | (main, Mar 30 2022, 08:38:02) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19044-SP0', 'event': 'created'}


Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...>


In [1]:
!jupyter nbconvert --to html "Gensim Code Exploration 2.ipynb"

[NbConvertApp] Converting notebook Gensim Code Exploration 2.ipynb to html
[NbConvertApp] Writing 606353 bytes to Gensim Code Exploration 2.html
