## This notebook contains our LDA approach to find topics in Terms of Services "text", build a dictionary and then group the text under those topics. We used this approach to cluster ToS into group of text related to copyright, privacy and termination.

In [25]:
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
import re
import numpy as np
import pandas as pd

# Reading data

First, we'll create a data structure to store all the information about the different ToS documents with which we'll train our LDA model, to get the different topics.

The information will be cleaned (removing stop-words, stemmed, etc.) and stores as a list of paragraphs.

In [26]:
#utility function to read files (ToS) to an list of paragraphs.
def read_file_to_paragraphs(file_path):
    file = open(file_path, 'r')
    doc = file.read()
    file.close()
    pars = re.split('\n\n+', doc) #some documents have a end of line during what could be considered the same 
                                    #paragraph, I believe 2 or more \n's if a better slip for this.
    print('reading %s wich have %d paragraphs' % (file_path, len(pars)))
    return(pars)

Reading all the files...

In [27]:
pars = read_file_to_paragraphs('./data/twitter_tos.txt')

reading ./data/twitter_tos.txt wich have 80 paragraphs


In [28]:
pars.extend(read_file_to_paragraphs('./data/facebook_tos.txt'))
pars.extend(read_file_to_paragraphs('./data/github_tos.txt'))
pars.extend(read_file_to_paragraphs('./data/google_privacy_tos.txt'))
pars.extend(read_file_to_paragraphs('./data/google_tos.txt'))
pars.extend(read_file_to_paragraphs('./data/snaptchat_tos.txt'))
pars.extend(read_file_to_paragraphs('./data/squarespace_tos.txt'))
pars.extend(read_file_to_paragraphs('./data/youtube_tos.txt'))

reading ./data/facebook_tos.txt wich have 35 paragraphs
reading ./data/github_tos.txt wich have 60 paragraphs
reading ./data/google_privacy_tos.txt wich have 71 paragraphs
reading ./data/google_tos.txt wich have 52 paragraphs
reading ./data/snaptchat_tos.txt wich have 83 paragraphs
reading ./data/squarespace_tos.txt wich have 153 paragraphs
reading ./data/youtube_tos.txt wich have 13 paragraphs


In [29]:
len(pars)

547

We'll train our model with 547 paragraphs.

In [30]:
tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()

#utility function that will tokenize, remove stop-words and stem the paragraphs.
def tokenize_and_stem(text, tokenizer, stemmer, stop_words):
    return([stemmer.stem(word) for word in tokenizer.tokenize(text.lower()) if word not in stop_words])

In [31]:
norm_texts = [tokenize_and_stem(par, tokenizer, p_stemmer, en_stop) for par in pars]

An example of the normalized text:

In [32]:
norm_texts[:2]

[['skip',
  'main',
  'content',
  'twitter',
  'languag',
  'english',
  'sign',
  'download',
  'thetwitteruseragr',
  'pdf'],
 ['live',
  'unit',
  'state',
  'twitter',
  'user',
  'agreement',
  'compris',
  'term',
  'servic',
  'privaci',
  'polici',
  'twitter',
  'rule',
  'incorpor',
  'polici']]

# Create Dictionary object

To train an LDA model, we first need to map words to numeric ids with a _Dictionary_ model.

In [33]:
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(norm_texts)
type(dictionary)

gensim.corpora.dictionary.Dictionary

In [34]:
#saving the dictionary object to used later (in the web app).
dictionary.save('lda_dictionary')

# Create BOW object

Also we'll need a bag-of-words representation of our text to train the LDA model.

In [35]:
# convert tokenized documents into a document-term matrix
bows = [dictionary.doc2bow(text) for text in norm_texts]
len(bows)

547

An example of how this bow structure looks like (for the first paragraph):

In [36]:
bows[0]

[(0, 1),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 1)]

# Train the LDA model

Now we're ready to train our LDA model. Reading through the documents and trying different parameters we concluded that a good number of topics to set our model with would be 10 (we're classifying our text in only 5 categories, but with this we're saying that the complete documents have (an average) of 10 different topics. When we tried 15 or 20 the topics seem to repeated.

In [37]:
ldamodel = gensim.models.ldamodel.LdaModel(bows, num_topics=10, id2word = dictionary, passes=20)
ldamodel.print_topics()

[(0,
  '0.059*"servic" + 0.027*"content" + 0.024*"use" + 0.020*"term" + 0.018*"will" + 0.015*"parti" + 0.014*"third" + 0.012*"damag" + 0.012*"includ" + 0.011*"access"'),
 (1,
  '0.037*"account" + 0.036*"twitter" + 0.021*"servic" + 0.018*"use" + 0.017*"term" + 0.015*"com" + 0.012*"may" + 0.010*"inform" + 0.010*"provid" + 0.010*"agreement"'),
 (2,
  '0.051*"privaci" + 0.049*"polici" + 0.034*"inform" + 0.033*"use" + 0.029*"servic" + 0.025*"term" + 0.023*"payment" + 0.021*"user" + 0.019*"end" + 0.019*"site"'),
 (3,
  '0.072*"inform" + 0.046*"use" + 0.039*"googl" + 0.026*"servic" + 0.023*"share" + 0.017*"may" + 0.017*"content" + 0.017*"facebook" + 0.016*"advertis" + 0.016*"collect"'),
 (4,
  '0.037*"servic" + 0.019*"chang" + 0.017*"domain" + 0.017*"renew" + 0.017*"may" + 0.016*"13" + 0.016*"term" + 0.016*"copyright" + 0.015*"fee" + 0.014*"notic"'),
 (5,
  '0.023*"right" + 0.021*"account" + 0.021*"will" + 0.017*"s" + 0.015*"may" + 0.015*"person" + 0.014*"state" + 0.014*"user" + 0.013*"inform

We'll save the model for later use (in our web app).

In [38]:
ldamodel.save('lda_model')

# Running the model in a new doc

The above topics appear to make sense, but they are not labeled nicely for our use. For example `(1,
  '0.057*"servic" + 0.036*"inform" + 0.036*"use" + 0.023*"privaci" + 0.022*"polici" + 0.015*"provid" + 0.013*"access" + 0.013*"term" + 0.012*"user" + 0.010*"collect"')` seem to be talking about privacy and use of information. We'll need to take these topics and mark them with a more readable label.

In the following dictionary we define the topics of our interest and wich words below to those topics.

In [39]:
#this topic list can be expanded with more topics and more words related to those topics.
topic_dic = {'privacy': ['privacy'], 'copyright': ['copyright'], 'content sharing/use': ['share'], 'cancelation/termination': ['cancelation', 'termination'], 'modification/pricing': ['modification', 'pricing'], 'special': ['law', 'jurisdiction', 'governing']}

Now we need to use our trained model to analyse a new document, for that we'll need to apply the model to each paragraph to find the most relevant topic and label that paragraph. We'll create a list element will be another list with the original paragraph (not cleaned) and the label, to analyze the paragraph with LDA we do clean the words in the text.

In [40]:
def create_topic_pars(pars, tokenizer, stemmer, stop_words, ldamodel, word_dictionary, topic_dictionary):
    norm_pars = [tokenize_and_stem(par, tokenizer, stemmer, stop_words) for par in pars]
    print('created normalized paragraphs object of length %d' % len(norm_pars))
    bows = [word_dictionary.doc2bow(text) for text in norm_pars]
    print('created bag-of-words object of length %d' % len(bows))
    topic_pars = []
    for idx, val in enumerate(bows):
        lda_vector = ldamodel[val]
        #original LDA model topic (most relevant) and paragraph:
        topic_pars.append([ldamodel.print_topic(max(lda_vector, key=lambda item: item[1])[0]), pars[idx]]) #we attach the original paragraph here, not the cleaned version that we used for LDA.
    
    #now we'll create a nicely labeled structure.
    tagged_pars = []
    for topic_name in topic_dictionary:
        topic_words = topic_dictionary[topic_name]
        for pars in topic_pars:
            topic = pars[0]
            par = pars[1]
            if(len(par) > 50):
                for word in topic_words:
                    if stemmer.stem(word) in topic:
                        tagged_pars.append([par, topic_name])
                        break
    return(tagged_pars)

We'll use Medium's ToS to try our model (obviously this was not part of the training data):

In [41]:
new_pars = read_file_to_paragraphs('./data/medium_tos.txt')

reading ./data/medium_tos.txt wich have 32 paragraphs


In [42]:
topic_pars = create_topic_pars(new_pars, tokenizer, p_stemmer, en_stop, ldamodel, dictionary, topic_dic)

created normalized paragraphs object of length 32
created bag-of-words object of length 32


In [43]:
topic_pars

[['These Terms of Service (“Terms”) are a contract between you and A Medium Corporation. They govern your use of Medium’s sites, services, mobile apps, products, and content (“Services”). By using Medium, you agree to these Terms. If you don’t agree to any of the Terms, you can’t use Medium. We can change these Terms at any time. We keep a historical record of all changes to our Terms on GitHub. If a change is material, we’ll let you know before they take effect. By using Medium on or after that effective date, you agree to the new Terms. If you don’t agree to them, you should delete your account before they take effect, otherwise your use of the site and content will be subject to the new Terms.',
  'copyright'],
 ['You own the rights to the content you create and post on Medium.',
  'copyright'],
 ['By posting content to Medium, you give us a nonexclusive license to publish it on Medium Services, including anything reasonably related to publishing it (like storing, displaying, reform

The result appear to make sense. To test the validity of our model we'll survey the result of several of these ToSs on different people and we'll get their opinions.

---

This part is the same as above but it was used to tried if we could run our model from the different saved parts, to use it in our web app.

# Running a saved model

In [44]:
dictionary2 = corpora.Dictionary.load('lda_dictionary')
type(dictionary2)

gensim.corpora.dictionary.Dictionary

In [45]:
ldamodel2 = gensim.models.ldamodel.LdaModel.load('lda_model')
type(ldamodel2)

gensim.models.ldamodel.LdaModel

In [46]:
topic_pars2 = create_topic_pars(new_pars, tokenizer, p_stemmer, en_stop, ldamodel2, dictionary2, topic_dic)

created normalized paragraphs object of length 32
created bag-of-words object of length 32


In [47]:
topic_pars2

[['These Terms of Service (“Terms”) are a contract between you and A Medium Corporation. They govern your use of Medium’s sites, services, mobile apps, products, and content (“Services”). By using Medium, you agree to these Terms. If you don’t agree to any of the Terms, you can’t use Medium. We can change these Terms at any time. We keep a historical record of all changes to our Terms on GitHub. If a change is material, we’ll let you know before they take effect. By using Medium on or after that effective date, you agree to the new Terms. If you don’t agree to them, you should delete your account before they take effect, otherwise your use of the site and content will be subject to the new Terms.',
  'copyright'],
 ['You own the rights to the content you create and post on Medium.',
  'copyright'],
 ['By posting content to Medium, you give us a nonexclusive license to publish it on Medium Services, including anything reasonably related to publishing it (like storing, displaying, reform