source: https://ner.pythonhumanities.com/03_05_generating_custom_word_vectors.html 


10 Generating Custom Word Vectors with Gensim


Dr. W.J.B. Mattingly, Smithsonian Data Science Lab and United States Holocaust Memorial Museum, January 2021


adapting method for Corridos Project: 1900 RGV Heros 

In [24]:
corpus = "../Corridos/corrido corpus/ElCorridodeGregorioCortez_X.txt"
with open (corpus, "r", encoding="utf-8") as f:
    corpus = f.read()
    #print(corpus)

In [25]:
stopwords = ["i","me","my","myself","we","our","ours","ourselves","you","your","yours","yourself","yourselves",
             "he","him","his","himself","she","her","hers","herself","it","its","itself","they","them","their",
             "theirs","themselves","what","which","who","whom","this","that","these","those","am","is","are","was",
             "were","be","been","being","have","has","had","having","do","does","did","doing","a","an","the","and",
             "but","if","or","because","as","until","while","of","at","by","for","with","about","against","between",
             "into","through","during","before","after","above","below","to","from","up","down","in","out","on","off",
             "over","under","again","further","then","once","here","there","when","where","why","how","all","any","both",
             "each","few","more","most","other","some","such","no","nor","not","only","own","same","so","than","too","very",
             "s","t","can","will","just","don","should","now"
            ]
corpus = corpus.lower()
words = corpus.split()

new_corpus = []
for word in words:
    if word not in stopwords:
        new_corpus.append(word)

corpus = " ".join(new_corpus)
print (corpus)

country karnes, look happened; major sheriff died, leaving román badly wounded. must two afternoon people arrived; said one another, "it known killed him." went around asking questions, half hour afterward, found wrongdoer gregorio cortez. outlawed cortez, throughout whole state; let taken, dead alive; killed several men. said gregorio cortez, pistol hand, "i don’t regret killed him; regret brother’s death." said gregorio cortez, soul aflame, "i don’t regret killed him; man must defend himself." americans coming, whiter dove, fear cortez pistol. americans said, said fearfully, "come, let us follow trail; wrongdoer cortez." set bloodhounds him, could follow trail, trying overtake cortez like following star.. struck gonzales without showing fear, "follow me, cowardly rangers, gregorio cortez." belmont went ranch, succeeded surrounding him, quite three hundred, jumped corral. jumped corral, according hear, got gunfight, killed another sheriff, said gregorio cortez, pistol hand, "don’t run

In [26]:
import spacy
import string

nlp = spacy.load("en_core_web_lg")
doc = nlp(corpus)

sentences = []
for sent in doc.sents:
    sentence = sent.text.translate(str.maketrans('', '', string.punctuation))
    words = sentence.split()
    sentences.append(words)
print (sentences)

[['country', 'karnes', 'look', 'happened', 'major', 'sheriff', 'died', 'leaving', 'román', 'badly', 'wounded'], ['must', 'two', 'afternoon', 'people', 'arrived', 'said', 'one', 'another', 'it', 'known', 'killed', 'him', 'went', 'around', 'asking', 'questions', 'half', 'hour', 'afterward', 'found', 'wrongdoer', 'gregorio', 'cortez'], ['outlawed', 'cortez', 'throughout', 'whole', 'state', 'let', 'taken', 'dead', 'alive', 'killed', 'several', 'men'], ['said', 'gregorio', 'cortez', 'pistol', 'hand', 'i', 'don’t', 'regret', 'killed', 'him', 'regret', 'brother’s', 'death'], ['said', 'gregorio', 'cortez', 'soul', 'aflame', 'i', 'don’t', 'regret', 'killed', 'him', 'man', 'must', 'defend', 'himself'], ['americans', 'coming', 'whiter', 'dove', 'fear', 'cortez', 'pistol'], ['americans', 'said', 'said', 'fearfully', 'come', 'let', 'us', 'follow', 'trail', 'wrongdoer', 'cortez'], ['set', 'bloodhounds', 'him', 'could', 'follow', 'trail', 'trying', 'overtake', 'cortez', 'like', 'following', 'star', '

In [27]:
def create_wordvecs(corpus, model_name):
    from gensim.models.word2vec import Word2Vec
    from gensim.models.phrases import Phrases, Phraser
    from collections import defaultdict
    
    print (len(corpus))
    

    phrases = Phrases(corpus, min_count=30, progress_per=10000)
    print ("Made Phrases")
    
    bigram = Phraser(phrases)
    print ("Made Bigrams")
    
    sentences = phrases[corpus]
    print ("Found sentences")
    word_freq = defaultdict(int)

    for sent in sentences:
        for i in sent:
            word_freq[i]+=1

    print (len(word_freq))
    
    print ("Training model now...")
    w2v_model = Word2Vec(min_count=1,
                        window=2,
                        vector_size=10,
                        sample=6e-5,
                        alpha=0.03,
                        min_alpha=0.0007,
                        negative=20)
    w2v_model.build_vocab(sentences, progress_per=10000)
    w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)
    w2v_model.wv.save_word2vec_format(f"data/{model_name}.txt")

create_wordvecs(sentences, "word_vecs")

26
Made Phrases
Made Bigrams
Found sentences
182
Training model now...


In [28]:
with open ("data/word_vecs.txt", "r") as f:
    data = f.readlines()
    print (data[0])

182 10



In [29]:
print(data[1])

cortez -0.0031606017 -8.9175715e-05 0.07099438 0.090066426 -0.08427642 -0.07545489 0.07783937 0.10610013 -0.057151973 -0.059844956



11. Loading Custom Word Vectors into a spaCy Model

In [39]:
model_name= "corridos_model_test"
word_vectors = "data/word_vecs.txt"

def load_word_vectors(model_name, word_vectors):
    import spacy
    import subprocess
    import sys
    print(model_name)
    #print(word_vectors)
    subprocess.run([sys.executable,
                    "-m",
                    "spacy",
                    "init-model",
                    "en",
                    model_name,
                    "--vectors-loc",
                    word_vectors
                        ]
                    )
    print (f"New spaCy model created with word vectors. File: {model_name}")
load_word_vectors(model_name, "data/word_vecs.txt")

corridos_model_test
New spaCy model created with word vectors. File: corridos_model_test
