# Master Thesis on the Semantics of (made-up) Names

* Author: Aron Joosse
* Supervisor: Giovanni Cassani
* Institution: Tilburg University

Can take inspiration from: https://github.com/Masetto96/BA-Thesis-form-meaning-mapping/blob/master/form_meaning_mapping.ipynb

# Library Imports

In [1]:
!pip install fasttext --progress-bar off
!pip install -U spacy --progress-bar off
!python -m spacy download en_core_web_sm
import fasttext
import spacy
import numpy as np
import pandas as pd
import re
import pickle

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 4.3 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


# Data Import

In [2]:
## Being able to access Google Drive
from google.colab import drive
drive.mount("/content/drive", force_remount=True) 

Mounted at /content/drive


In [3]:
## Getting the list of madeup names:

ratings_csv = pd.read_csv("drive/MyDrive/Thesis/Data/giovanni_email_data/avgRatings_annotated.csv",
                          usecols = ["name", "name_type"])

ratings_csv.head(10)

madeup_names = []

for i in ratings_csv.index:                                           ## I can do exactly the same thing for talking & real
  if ratings_csv["name_type"][i] == "madeup":
    madeup_names.append(str(ratings_csv["name"][i]))

madeup_names_lower = list(map(lambda x: x.lower(), madeup_names))

print(madeup_names[:5])
print(len(madeup_names))
print(madeup_names_lower[:5])
print(len(madeup_names_lower))

['Alastor', 'Alecto', 'Amabala', 'Araminta', 'Arcturus']
60
['alastor', 'alecto', 'amabala', 'araminta', 'arcturus']
60


## COCA

In [4]:
path = "drive/My Drive/Thesis/Data/CoCA/Text/"
unclean_path = path + "texts_combined/all_texts_combined.txt"
unclean_corpus = open(unclean_path).read()


In [5]:
print(len(unclean_corpus))
print(unclean_corpus[:100])

2977527143
@@4170367 Headnote # A puzzle has long pervaded the criminal law : why are two offenders who commit 


## Names

# Preprocessing


## Cleaning Corpus

In [6]:
## Loading the English spacy pipeline and removing stopwords

nlp = spacy.load("en_core_web_sm")
nlp.max_length = 10000000000

nlp.Defaults.stop_words.remove('him')
nlp.Defaults.stop_words.remove('her')
nlp.Defaults.stop_words.remove('hers')
nlp.Defaults.stop_words.remove('his')
nlp.Defaults.stop_words.remove('he')
nlp.Defaults.stop_words.remove('she')
nlp.Defaults.stop_words.remove('himself')
nlp.Defaults.stop_words.remove('herself')

In [None]:
def clean_corpus_unsentenced(data):
    # Tokenization
    with nlp.select_pipes(disable=["lemmatizer", "tok2vec", "tagger", "parser"]):
      nlp.enable_pipe("senter")
      doc = nlp(data)
    print(doc[:150])

    doc_filtered = []

    for token in doc:
      if token.is_upper is True:
        continue
      elif token.is_stop is True:
        continue
      elif str(token).lower() in madeup_names_lower:
        continue
      elif token.is_alpha:
        doc_filtered.append(str(token).lower())
      else: 
        continue

    doc_filtered = " ".join(doc_filtered)

    print(doc_filtered[:500])

    # Remove words with freq < XX

clean_corpus_unsentenced(unclean_corpus)#[:1000000])

ValueError: ignored

In [None]:
def clean_corpus_sentenced(data):
    # Tokenization
    doc = nlp(data)
    print(doc[:50])
    print(list(doc.sents)[:15])

    doc_filtered = []

    sentence = []

    for token in doc:
      if token.is_sent_start is True:
        doc_filtered.append(" ".join(sentence))
        sentence = []
      
      if token.is_upper is True:
        continue
      elif token.is_stop is True:
        continue
      elif str(token).lower() in madeup_names_lower:
        continue
      elif token.is_alpha:
        sentence.append(str(token).lower())

    doc_filtered = list(filter(lambda x: x != "", doc_filtered))
    print(doc_filtered[:15])

    # Remove words with freq < XX
    
    ## Because I'm using fastText, I think it is best to remove words whose lemma occurs < X times,
    ## since I don't really care if e.g., 'running' occurs only once if 'run' and '-ing' both occur
    ## a bunch of times. So it's probably best to calculate the frequency of the lemmas and then remove
    ## words whose lemma occurs less than 3 times? 
    ## I still also don't know how to set my frequency threshold properly, so idk man.

    ## Also, I would like to do a stop_words removed vs. stop_words not removed comparison to see whether
    ## performance actually increases when I remove stop words, because I don't know whether it actually
    ## does for the purpose of my specific analysis 
    
    ## --> I can look at what other papers did maybe, for minimum word freq. and stop words!

clean_corpus_sentenced(unclean_corpus[:1000000])

@@4170367 Headnote # A puzzle has long pervaded the criminal law : why are two offenders who commit the same criminal act punished differently when one of them , due to circumstances beyond her control , causes more harm than the other ? This tradition of result-based differential
[@@4170367, Headnote # A puzzle has long pervaded the criminal law : why are two offenders who commit the same criminal act punished differently when one of them , due to circumstances beyond her control , causes more harm than the other ?, This tradition of result-based differential punishment-the practice of varying offenders ' punishment based on whether or not they cause specific " statutory harms " -has long stood as an intractable problem for scholars and jurists alike ., # This Article proposes a solution to this long-standing conceptual problem ., We begin by introducing a dichotomy between two broad and exhaustive categories of ideological justifications for punishing criminal offenders ., The first 

In [None]:
drive.flush_and_unmount()
print('All changes made in this colab session should now be visible in Drive.')

All changes made in this colab session should now be visible in Drive.


## Training fastText and Validating on Word Embeddings Benchmark

In [None]:
# Skipgram model :
#model = fasttext.train_unsupervised('data.txt', model='skipgram')

#model.save_model("model_filename.bin")

#model = fasttext.load_model("model_filename.bin")

#model.get_nearest_neighbors('asparagus')

#In a similar spirit, one can play around with word analogies. For example, we can see if our model can guess what is to France, and what Berlin is to Germany.
#This can be done with the analogies functionality. It takes a word triplet (like Germany Berlin France) and outputs the analogy:
#model.get_analogies("berlin", "germany", "france")