


# Document Similarity with Latent Semantic Analysis (LSA)

The following notebook walks you through doing LSA document similarity in Python. We then output the document similarity matrix as a .csv file which can be manipulated to highlight similarity between documents. You then have the option of using our "docSimLSAHeatmap" notebook to create a heatmap of cosine similarity scores between documents.

###  Before we begin
Before we start, you will need to have set up a [Carbonate account](https://kb.iu.edu/d/aolp) in order to access [Research Desktop (ReD)](https://kb.iu.edu/d/apum). You will also need to have access to ReD through the [thinlinc client](https://kb.iu.edu/d/aput). If you have not done any of this, or have only done some of this, but not all, you should go to our [textPrep-Py.ipynb](https://github.com/cyberdh/Text-Analysis/blob/drafts/textPrep-Py.ipynb) before you proceed further. The textPrep-Py notebook provides information and resources on how to get a Carbonate account, how to set up ReD, and how to get started using the Jupyter Notebook on ReD.   

### Run CyberDH environment
The code in the cell below points to a Python environment specificaly for use with the Python Jupyter Notebooks created by Cyberinfrastructure for Digital Humanities. It allows for the use of the different pakcages in our notebooks and their subsequent data sets.

##### Packages
- **sys:** Provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter. It is always available.
- **os:** Provides a portable way of using operating system dependent functionality.

#### NOTE: This cell is only for use with Research Desktop. You will get an error if you try to run this cell on your personal device!!

In [1]:
import sys
import os
sys.path.insert(0,"/N/u/cyberdh/Carbonate/dhPyEnviron/lib/python3.6/site-packages")
os.environ["NLTK_DATA"] = "/N/u/cyberdh/Carbonate/dhPyEnviron/nltk_data"

### Include necessary packages for notebook 

Python's extensibility comes in large part from packages. Packages are groups of functions, data, and algorithms that allow users to easily carry out processes without recreating the wheel. Some packages are included in the basic installation of Python, others created by Python users are available for download. Make sure to have the following packages installed before beginning so that they can be accessed while running the scripts.

In your terminal, packages can be installed by simply typing `pip install nameofpackage --user`. However, since you are using ReD and our Python environment, you will not need to install any of the packages below to use this notebook. Anytime you need to make use of a package, however, you need to import it so that Python knows to look in these packages for any functions or commands you use. Below is a brief description of the packages we are using in this notebook:  

- **sklearn:** Simple and efficient tools for data mining and data analysis built on NumPy, SciPy, and matplotlib.
- **pandas:** An open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
- **warnings:** Allows for the manipulation of warning messages in Python.
- **numpy:** a general-purpose array-processing package designed to efficiently manipulate large multi-dimensional arrays of arbitrary records without sacrificing too much speed for small multi-dimensional arrays. 
- **string:** Contains a number of functions to process standard Python strings.
- **nltk:** A leading platform for building Python programs to work with human language data.
- **spacy:** A library for advanced Natural Language Processing in Python and Cython.

Notice we import some of the packages differently. In some cases we just import the entire package when we say `import XYZ`. For some packages which are small, or, from which we are going to use a lot of the functionality it provides, this is fine. 

Sometimes when we import the package directly we say `import XYZ as X`. All this does is allow us to type `X` instead of `XYZ` when we use certain functions from the package. So we can now say `X.function()` instead of `XYZ.function()`. This saves time typing and eliminates errors from having to type out longer package names. I could just as easily type `import XYZ as potato` and whenever I use a function from the `XYZ` package I would need to type `potato.function()`. What we import the package as is up to you, but some commonly used packages have abbreviations that are standard amongst Python users such as `import pandas as pd` or `import matplotlib.pyplot as plt`. You do not need to us `pd` or `plt`, however, these are widely used and using something else could confuse other users and is generally considered bad practice. 

Other times we import only specific elements or functions from a package. This is common with packages that are very large and provide a lot of functionality, but from which we are only using a couple functions or a specific subset of the package that contains the functionality we need. This is seen when we say `from XYZ import ABC`. This is saying I only want the `ABC` function from the `XYZ` package. Sometimes we need to point to the specific location where a function is located within the package. We do this by adding periods in between the directory names, so it would look like `from XYZ.123.A1B2 import LMN`. This says we want the `LMN` function which is located in the `XYZ` package and then the `123` and `A1B2` directory in that package. 

You can also import more than one function from a package by separating the functions with commas like this `from XYZ import ABC, LMN, QRS`. This imports the `ABC`, `LMN` and `QRS` functions from the `XYZ` package.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import warnings
import numpy as np
import string
import nltk
from nltk.corpus import stopwords
import spacy

This will ignore deprecation and future warnings. All the warnings in this code are not concerning and will not break the code or cause errors in the results.

In [3]:
# Suppress warnings from pandas library
warnings.filterwarnings("ignore", category=DeprecationWarning,
                        module="pandas", lineno=570)
warnings.filterwarnings("ignore", category=FutureWarning,
                        module = "sklearn", lineno = 1059)
warnings.filterwarnings("ignore", category=UserWarning,
                        module = "sklearn", lineno = 300)

### Getting your data

#### File paths
Here we are saving as variables different file paths that we need in our code. We do this so that they are easier to call later and so that you can make most of your changes now and not need to make as many changes later. 

First we use the `os` package above to find our `["HOME"]` directory using the `environ` function. This will work for any operating system, so if you decide to try this out on your personal computer instead of ReD, the `homePath` variable will still be the path to your 'home' directory, so no changes are needed.

Next, we combine the `homePath` variable with the folder names that lead to where our data is stored. Note that we do not use any file names yet, just the path to the folder. This is because we are comparing documents to one another, so we need to read in an entire directory. You will want to change the folder names to match your folder names in your file path.

Now we add the `homePath` variable to other folder names that lead to a folder where we will want to save our document similarity matrix. You again will want to change the folder names in the path to match your own folder names. We save this file path as the variable `cleanedData`.

In [4]:
homePath = os.environ["HOME"]
dataHome = os.path.join(homePath, "Text-Analysis-master", "data", "shakespeareDated")
cleanedData = os.path.join(homePath, "Text-Analysis-master", "TopicModeling", "LSA", "cleanedData")

#### Set needed variables

Now we assign values to variables that will inform various parts of our code. Just like the file path variables, this is done so you have to make fewer changes later and also to make the changes easier to find by putting them in one place.

- **nltkStop:** If you want to use the stopword list that comes with the nltk package then set `nltkStop` equal to `True`. If you do not wish to use the nltk stopword list then set `nltkStop` equal to `False`.

- **customStop:** If you have created your own custom stopword list and wish to use that, then set `customStop` equal to `True`. If you do not have your own custom stopword list then set `customStop` equal to `False`.

**NOTE: You can use both the nltk and custom stopword lists or you can use neither or just one or the other. You do NOT need to set them both to True or both to False. Use whatever works best for you.**

- **lem:** Next we decide if we want to lemmatize our words. Lemmatizing words will turn certain words to the root of the word. So "are" and "is" become "be" and "runs" and "running" become "run". This will probably increase the similarity of documents as they will then share more words in common. If you want to lemmatize the words in your dataset then assign `True` to the variable `lem`. If you do not wish to lemmatize your words then assign `False` to the variable `lem`.

- **lowerCase:** Then we decide if we want all the words in our dataset lowercased. This will change "Love" to "love" so that it is recognized as the same word for similarity purposes. However, there are some cases where the use of capitalization may be important to determining similarity, so we have the option to lowercase or not. If you want to lowercase all the words in your dataset assign `True` to the variable `lower`. If you do not wish to lowercase all the words in your dataset then assign `False` to the variable `lower`.

- **removeDigits:** Now we decide if we want to remove numbers from out text. Again, removing numbers will increase the similarity of texts as page numbers and other integers that may not be exactly alike will be removed. However, there are instances where numbers are thematically important, and they need to be kept. Here is where you make that decision. If you wish to remove all numbers then assign `True` to the `removeDigits` variable. If you wish to retain all numbers then assign `False` to the `removeDigits` variable.

- **language:** Now we choose the language we will be using for the nltk stopwords list. If you need a different language, simply change 'english' (keep the quotes) in the `language` variable to the anglicized name of the language you wish to use (e.g. 'spanish' instead of 'espanol' or 'german' instead of 'deutsch').

- **lemLang:** Now we choose the language for our lemmatizer. The languages available for spacy include the list below and the abbreviation spacy uses for that language. To choose a language simply type the two letter code following the angliscized language name in the list. So for Spanish it would be `'es'` (with the quotes) and for German `'de'` and so on.

- **English:** `'en'`
- **Spanish:** `'es'`
- **German:** `'de'`
- **French:** `'fr'`
- **Italian:** `'it'`
- **Portuguese:** `'pt'`
- **Dutch:** `'nl'`
- **Multi-Language:** `'xx'`

- **encoding/errors:** The variable `encoding` is where you determine what type of encoding to use (ascii, ISO-8850-1, utf-8, etc...). We have it set to utf-8 at the moment as we have found it is less likely to have any problems. However, errors do occur, but the encoding errors rarely impact our results and it causes the Python code to exit. So instead of dealing with unhelpful errors we ignore the ones dealing with encoding by assigning `'ignore'` to the `errors` variable. If you want to see any encoding errors then change `'ignore'` to `None` without the quotes.

- **singleDocs:** If your data exists as a single file for each document and is in one directory, then assign `True` to the `singleDocs` variable. If each of your "documents" is actually multiple directories of multiple files and each directory needs to be concsiderd as a separate "document", then assign `False` to the `singleDocs` variable.

- **nComp:** The LSA algorithm used by the sklearn Python package has a required parameter called `n_components` for "number of components" (the purpose of this parameter will be explained later). There is a part of the code further down in this notebook that will help determine what the ideal number of components is for your corpus to produce the most accurate result. This will also make the process take longer as it is an added complicated step. If you want to perform this part of the code and try to determine the best number of components, then assigne `True` to the `nComp` variable. If you do not wish to perform this added step then assign `False` to the `nComp` variable. The default number of components will be the number of "documents" in your corpus if you assign `False` to `nComp`.

- **stopWords = []:** The `stopWords = []` variable is simply an empty list. This is where the words from the nltk stopword list or your custom stopword list or both combined or neither (depending on what you decide) will reside later on. You do not need to do anything to this line of code.

- **tokenDict = {}:** The `tokenDict = {}` variable is an empty dictionary. This is where your documents will reside later. The file or folder name (depending on your choices above) for the document will be the key and the content of the document will be the value. This will be explained in more detail later. For now, you do not need to do anything to this line.

In [5]:
nltkStop = True
customStop = True
lem = True
lowerCase = True
removeDigits = True
language = 'english'
lemLang = "en"
encoding = 'utf-8'
errors = 'ignore'
singleDocs = True
nComp = True
stopWords = []
tokenDict = {}

### Stopwords
If you set `nltkStop` equal to **True** above then this will add the NLTK stopwords list to the empty list named `stopWords`.

You should have already chosen your desired language above, but if you wish to add any words to the stopWords list then add the word(s) you want as a stop word in the `stopWords.extend(['words', 'you', 'want', 'to', 'add'])` part of the code.

In [6]:
if nltkStop is True:
    # NLTK Stop words
    stopWords = stopwords.words(language)

    stopWords.extend(['would', 'said', 'says', 'also'])

#### Add own stopword list

Here is where your own stopwords list is added if you set `customStop` equal to **True** above. Here you will need to change the folder names and file name to match your folders and file. Remember to put each folder name in quotes and in the correct order always putting the file name including the file extension (.txt) last.

In [7]:
if customStop is True:
    stopWordsFilepath = os.path.join(homePath, "Text-Analysis-master", "data", "earlyModernStopword.txt")

    with open(stopWordsFilepath, "r",encoding = encoding, errors = errors) as f:
        stopWordsList = [x.strip() for x in f.readlines()]

    stopWords.extend(stopWordsList)

### Functions
We need to create a function in order to stem and tokenize our data. Any time you see `def` that means we are **DE**claring a **F**unction. The `def` is usually followed by the name of the function being created and then in parentheses are the parameters required by the function. After the parentheses is a colon, which closes the declaration, then a bunch of code below which is indented. The indented code is the program statement or statements to be executed. Once you have created your function all you need to do in order to run it is call the function by name and make sure you have included all the required parameters in the parentheses. This allows you to call the function without having to write out all the code in the function every time you wish to perform that task.

#### Tokenization functions
Here we have several "if...else" statements. If we set `lem` to `True*` above we create two functions. One to ignore the "\n" markers in our text and the next to tokenize and lemmatize our text. Both of these functions require the language dictionary being used to lemmatize the text. Therefore, within the `if lem is True:` statement, we have several "if...else" statements to specify which language you assigned to the `lemLang` variable above. 

However, if we assigned `False` to `lem` above then we only create one function that tokenizes our text, no need to specify a language.

You should not need to make any changes to this block of code.

In [8]:
if lem is True:
    nlp = spacy.load(lemLang, diasble=["parser","ner"])
    nlp.max_length=2500000
    def tokenFilter(token):
        return not (token.is_space)
    
    def tokenize(text):
        for doc in nlp.pipe([text]):
            tokens = [token.lemma_ for token in doc if tokenFilter(token)]
        return tokens

else:
    def tokenize(text):
        tokens = nltk.word_tokenize(text)
        return tokens

#### Read in documents
Now we read in our documents and also perform some text cleaning. This code lower cases all the words as well as removes punctuation and digits, depending what you assigned for the `lowerCase` and `removeDigits` variables above. Then it adds the file names and cleaned content of each file to our previously empty `tokenDict` dictionary above. You should not need to make any changes to this code.

A dictionary is similar to a list except it has what are called 'keys' and 'values'. This basically allows us to label our data. In this case we will be making the file or folder names of our documents the 'keys' and the content of the file(s) the 'values' so that each document name correlates to the content of that document.

The `if singleDocs is True:` statement says that if we assigned `True` to the `singleDocs` variable above, then the code below the statement will be run, which reads in files as if each each file is a single document. If we did NOT assign `True` to `singleDocs` (`else`), then the code below `else` will be run instead, which reads in the files for each directory and treats each directory as a single volume.

In [9]:
if singleDocs is True:
    for subdir, dirs, files in os.walk(dataHome):
        for file in files:
            if file.startswith('.'):
                    continue
            filePath = subdir + os.path.sep + file
            with open(filePath, 'r', encoding = encoding, errors = errors) as textFile:
                text = textFile.read()
                if lowerCase and removeDigits is True:
                    lowers = text.lower()
                    noPunctuation = lowers.translate(str.maketrans('','', string.punctuation))
                    noDigits = noPunctuation.translate(str.maketrans('','', string.digits))
                    tokenDict[file] = noDigits
                elif lowerCase == True and removeDigits == False:
                    lowers = text.lower()
                    noPunctuation = lowers.translate(str.maketrans('','', string.punctuation))
                    tokenDict[file] = noPunctuation
                elif lowerCase == False and removeDigits == True:
                    noPunctuation = text.translate(str.maketrans('','', string.punctuation))
                    noDigits = noPunctuation.translate(str.maketrans('','', string.digits))
                    tokenDict[file] = noDigits
                else:
                    noPunctuation = text.translate(str.maketrans('','', string.punctuation))
                    tokenDict[file] = noPunctuation
else:
    data = []
    text = []
    for folder in sorted(os.listdir(dataHome)):
        if not os.path.isdir(os.path.join(dataHome, folder)):
            continue
        for file in sorted(os.listdir(os.path.join(dataHome, folder))):
            data.append(((dataHome,folder,file)))
    df = pd.DataFrame(data, columns = ["Root", "Folder", "File"])
    df["Paths"] = df["Root"].astype(str) + "/" + df["Folder"].astype(str) + "/" + df["File"].astype(str)
    for path in df["Paths"]:
        if not path.endswith(".txt"):
            continue
        with open(path, "r", encoding=encoding, errors = errors) as f:
            t = f.read().strip().split()
            if lowerCase and removeDigits is True:
                lowers = ' '.join(t).lower()
                noPunctuation = lowers.translate(str.maketrans('','', string.punctuation))
                noDigits = noPunctuation.translate(str.maketrans('','', string.digits))
                text.append(noDigits)
            elif lowerCase == True and removeDigits == False:
                lowers = ' '.join(t).lower()
                noPunctuation = lowers.translate(str.maketrans('','', string.punctuation))
                text.append(noPunctuation)
            elif lowerCase == False and removeDigits == True:
                noPunctuation = ' '.join(t).translate(str.maketrans('','', string.punctuation))
                noDigits = noPunctuation.translate(str.maketrans('','', string.digits))
                text.append(noDigits)
            else:
                noPunctuation = text.translate(str.maketrans('','', string.punctuation))
                text.append(noPunctuation)
    df["Text"] = pd.Series(text)
    df["Text"] = ["".join(map(str, l)) for l in df["Text"].astype(str)]
    d = {'Text':'merge'}
    dfText = df.groupby(["Folder"])["Text"].apply(lambda x: ' '.join(x)).reset_index()
    
    tokenDict = dict(zip(dfText["Folder"], dfText["Text"]))

Let's check and see if our dictionary now has our data. We are asking to see the first 10 keys of our dictionary.

In [10]:
print(list(tokenDict.keys())[:10])

['1610Cymbeline.txt', '1596MerchantOfVenice.txt', '1604AllsWellThatEndsWell.txt', '1591KingHenry6_3.txt', '1599JuliusCaesar.txt', '1600TroilusAndCressida.txt', '1605TimonOfAthens.txt', '1592KingRichard3.txt', '1591KingHenry6_2.txt', '1603MeasureForMeasure.txt']


#### Tfidf Vectorizer

Here we weight the importance of each word in the document. This is done using Term Frequency-Inverse Document Frequency (Tfidf). This considers how important a word is based on the frequency in the whole corpus as well as in individual documents. This allows for words that might not have a high frequency in an entire collection, but do have a high frequency in one or two documents when compared to other words to still be given a higher level of importance throughout the text.

In [11]:
vectorizer = TfidfVectorizer(tokenizer = tokenize, stop_words = stopWords)
dtm = vectorizer.fit_transform(tokenDict.values())

The below code outputs the first 10 words that make up the columns once we have broken our corpus down into a Tfidf matrix. To output all of the words remove the square brackets and their contents in the `print(vectGFN[:20])` line of code. To change the number of words printed change `20` in the same line to the number of words you wish to see.

In [12]:
# Get words that correspond to each column
vectGFN = vectorizer.get_feature_names()
print(vectGFN[:20])

['-PRON-', 'aaron', 'abandon', 'abase', 'abash', 'abate', 'abated', 'abatement', 'abatfowling', 'abbess', 'abbey', 'abbot', 'abbreviate', 'abc', 'abe', 'abed', 'abel', 'abergavenny', 'abet', 'abhor']


If we assigned `True` to the `nComp` variable above, this cell will create a function that determines the number of components that first gives us an explained variance ratio of our choice (currently 0.95 in code cell directly below this one). The math is complicated, but we are reducing our tfidf matrix to a more manageable size, but we need to know how many components we can reduce our matrix to and still keep a certain percentage of our data (again, currently set to 0.95 or 95% of our data in the code cell after this one). If we assigned `False` to the `nComp` variable, this cell will be skipped.

In [13]:
if nComp is True:
    tsvd = TruncatedSVD(n_components=dtm.shape[1]-1)
    tsvd.fit(dtm)
    tsvdVarRatios = tsvd.explained_variance_ratio_

    def selectNcomponents(var_ratio, goal_var: float) -> int:
        total_variance = 0.0
        n_components = 0
        for explained_variance in var_ratio:
            total_variance += explained_variance
            n_components += 1
            if total_variance >= goal_var:
                break
        return n_components
else:
    None

If we assigned `True` to the `nComp` variable we will now apply the function we created in the cell above. This will now incremently increase the number of components we will reduce our matrix to until we get a 95% or higher variance ratio, meaning we are still keeping 95% or more of our data. If we assigned `False` to `nComp` then this cell will be skipped.

In [14]:
if nComp is True:
    nc = selectNcomponents(tsvdVarRatios, 0.95)
    print(nc)
else:
    None

31


#### Run SVD and Cosine Similarity

Here we run our Tfidf matrix created above through Singular Value Decomposition and then calculate the Cosine Similarity of the documents to one another. 

Singular Value Decomposition condenses our Tfidf matrix down a bit to make it easier to process. Here we also set the number of dimensions (`n_components`) , how many times we iterate over the corpus (`n_iter`), and then set the seed (`random_state`) so that the results are reproducable. At the moment the `random_state` is set to 42 which sets the seed for the random number generator, but feel free to adjust the number to get a slightly different output. Just make sure you keep the seed the same once you find one you like for reproducibility.

Cosine similarity is where we measure how similar the documents are to one another. The result is a number between -1 and 1 with 1 being a perfect match (which we will get when the document is compared to itself) and -1 being completely different which we might get if we have a document of all numbers and one of all words with no numbers at all. Usually, even documents that are about unrelated topics share some common words and so are not completely dissimilar.

In [15]:
if nComp is True:
    lsa = TruncatedSVD(n_components = nc, n_iter = 1000, random_state = 42)
    dtmLsa = lsa.fit_transform(dtm)
    cosineSim = cosine_similarity(dtmLsa)
else:
    lsa = TruncatedSVD(n_components = dtm.shape[0], n_iter = 1000, random_state = 42)
    dtmLsa = lsa.fit_transform(dtm)
    cosineSim = cosine_similarity(dtmLsa)

#### Save as .csv

Now we save the results as a .csv file. First we name the output .csv file so it matches our data. We do this in the first line of the cell.

Then we create the data frame and say we want the rows and columns to be labeled with the file names. Then we sort the columns in alphanumeric order by column header, then we sort the rows alphanumericaly by row label.

Finally, we export the dataframe as a .csv file.

You can manipulate the .csv file in excel or some other spreadsheet software, or you can use it in the "docSimLSAHeatmap" notebook that is part of this text analysis repository.

In [16]:
csvFileName = "docSimilarityMatrix.csv"

df = pd.DataFrame(cosineSim, index = tokenDict.keys(), columns=tokenDict.keys())
dfS = df[sorted(df)]
sortedDf = dfS.sort_index(axis = 0)
np.fill_diagonal(sortedDf.values, np.nan)
sortedDf.to_csv(os.path.join(cleanedData, csvFileName))

This notebook was adapted from https://www.datascienceassn.org/sites/default/files/users/user1/lsa_presentation_final.pdf at Colorado University, Boulder. Accessed on 02/01/2019.