<a href="https://colab.research.google.com/github/ValleSell/cs224u/blob/master/germanWord2vecEmbeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [GermanWordEmbeddings](https://devmount.github.io/GermanWordEmbeddings/) 
See Link for more information on training and text preprocessing.

In [None]:
%matplotlib inline
import gensim
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

Download the German word2vec model if not downloaded already

In [None]:
! test -e german.model || wget https://cloud.devmount.de/d2bc5672c523b086/german.model
#Variieren die Modele je nach (natürlicher) Sprache? Und wenn ja, warum? window size

In [None]:
# load trained model
model = gensim.models.KeyedVectors.load_word2vec_format("german.model", binary=True)

# Compute nearest neighbours

In [None]:
for w, sim in model.most_similar("mies", topn=10):
	print(w, sim)

In [None]:
for w, sim in model.most_similar("frau", topn=10):
	print(w, sim)
 #bei sexistischen Darstellungen wird "Frau" kleingeschrieben oder wie ist das hier zu interpretieren?

In [None]:
for w, sim in model.most_similar("Frau", topn=10):
	print(w, sim)

In [None]:
for w, sim in model.most_similar("Mann", negative=["jung"], topn=10):
	print(w, sim)
 
print ("---\n")
for w, sim in model.most_similar("Mann", negative=["Frau"], topn=10):
  print(w, sim)

#??

In [None]:
for w, sim in model.most_similar("Schweiz", topn=10):
	print(w, sim)

In [None]:
help(model.most_similar)

# Analogy
"Frau" is to "Koenigin" as "Mann" is to ???

In vector arithmetic: "Mann" + "Koenigin" - "Frau"

In [None]:
 
 print(model.most_similar(positive=['Mann', 'Koenigin'], negative=['Frau']))
 print ("\n----\n")
 model.most_similar(positive=['Mann', 'Hausfrau'], negative=['Frau'])

# Addition
Cultural biases end up in the embedding space... Try "Frau"+"Arbeit"...

In [None]:
print(model.most_similar(positive=['Frau', 'Arbeit']))
print ("\n----\n")
model.most_similar(positive=['Mann', 'Arbeit'])

In [None]:
print(model.most_similar(positive=['Popcorn', 'Filme']))
print ("\n----\n")
model.most_similar(positive=['Popcorn', 'Film'])

# Subtraction


In [None]:
model.most_similar(positive=['Python'],negative=['Software'])

In [None]:
model.most_similar(positive=['Python'],negative=['Tier'])

# Plot word vectors
##### See [visualize.py](https://github.com/devmount/GermanWordEmbeddings/blob/master/visualize.py) script from [GermanWordEmbeddings](https://devmount.github.io/GermanWordEmbeddings/)

The following code gives an example of how to reduce dimensionality of word vectors with PCA or t-SNE.
With two dimensions left, the words can be plotted as points in a two-dimensional graph.

You need [gensim](https://radimrehurek.com/gensim/install.html), [matplotlib](http://matplotlib.org/faq/installing_faq.html#how-to-install) and [scikit-learn](http://scikit-learn.org/dev/install.html) for this script to work.

The following function is used to compute PCA/t-SNE representation of words and returns a configured and styled plot.

In [None]:
def draw_words(model, words, pca=False, alternate=True, arrows=True, x1=3, x2=3, y1=3, y2=3, title=''):
    # get vectors for given words from model
    vectors = [model[word] for word in words]

    if pca:
        pca = PCA(n_components=2, whiten=True)
        vectors2d = pca.fit(vectors).transform(vectors)
    else:
        tsne = TSNE(n_components=2, random_state=0)
        vectors2d = tsne.fit_transform(vectors)

    # draw image
    plt.figure(figsize=(6,6))
    if pca:
        plt.axis([x1, x2, y1, y2])

    first = True # color alternation to divide given groups
    for point, word in zip(vectors2d , words):
        # plot points
        plt.scatter(point[0], point[1], c='r' if first else 'g')
        # plot word annotations
        plt.annotate(
            word, 
            xy = (point[0], point[1]),
            xytext = (-7, -6) if first else (7, -6),
            textcoords = 'offset points',
            ha = 'right' if first else 'left',
            va = 'bottom',
            size = "x-large"
        )
        first = not first if alternate else first

    # draw arrows
    if arrows:
        for i in xrange(0, len(words)-1, 2):
            a = vectors2d[i][0] + 0.04
            b = vectors2d[i][1]
            c = vectors2d[i+1][0] - 0.04
            d = vectors2d[i+1][1]
            plt.arrow(a, b, c-a, d-b,
                shape='full',
                lw=0.1,
                edgecolor='#bbbbbb',
                facecolor='#bbbbbb',
                length_includes_head=True,
                head_width=0.08,
                width=0.01
            )

    # draw diagram title
    if title:
        plt.title(title)

    plt.tight_layout()
    plt.show()

Now that we have all tools to process word vectors, we need to load the `word2vec` language model that contains our high-dimensional word vectors.

With the model and the `draw_words()` function a list of words can be plotted. When 2 word classes are given (like the first 3 following examples), put them alternately in the list and set the `alternate` parameter of the function to `True`. That procuces arrows and improved label positions.

In [None]:
# plot currencies
words = ["Schweiz","Franken","Deutschland","Euro","Grossbritannien","britische_Pfund","Japan","Yen","Russland","Rubel","USA","US-Dollar","Kroatien","Kuna"]
draw_words(model, words, pca=True, alternate=True, arrows=True, x1=-3, x2=3, y1=-2, y2=3, title=r'$PCA\ Visualisierung:\ W\ddot{a}hrung$')

In [None]:
# plot capitals
words  = ["Athen","Griechenland","Berlin","Deutschland","Ankara","Tuerkei","Bern","Schweiz","Hanoi","Vietnam","Lissabon","Portugal","Moskau","Russland","Stockholm","Schweden","Tokio","Japan","Washington","USA"]
draw_words(model, words, True, True, True, -3, 3, -2, 3, r'$PCA\ Visualisierung:\ Hauptstadt$')

In [None]:
# plot language
words = ["Deutschland","Deutsch","USA","Englisch","Frankreich","Franzoesisch","Griechenland","Griechisch","Norwegen","Norwegisch","Schweden","Schwedisch","Polen","Polnisch","Ungarn","Ungarisch"]
draw_words(model, words, True, True, True, -3, 3, -2, 3, r'$PCA\ Visualisierung:\ Sprache$')

The next example shows related words to a given word, using the `most_similar()` function of gensim.

In [None]:
# plot related words to 'house'
matches = model.most_similar(positive=["Haus"], negative=[], topn=10)
words = [match[0] for match in matches]
draw_words(model, words, True, False, False, -3, 2, -2, 2, r'$PCA\ Visualisierung:\ Haus$')

Finally an example for capturing correct gender of given name.

In [None]:
# plot name
words = ["Alina","Aaron","Charlotte","Ben","Emily","Elias","Fiona","Felix","Johanna","Joel","Lara","Julian","Lea","Linus","Lina","Lukas","Mia","Mika","Sarah","Noah","Sophie","Simon"]
draw_words(model, words, True, True, False, -3, 3, -1.5, 2.5, r'$PCA\ Visualisierung:\ Namen\ nach\ Geschlecht$')