# Who's Fake is it Anyway? HCVIS Proseminar Final Project
---
Davienne Gabriel<br>
November 23rd, 2020

---

### Introduction

> This Jupyter Notebook will serve as the process component of the final paper. At each step I will explain the code and what it does. I will also provide some commentary on the choices I made along the way, and how those ultimately shaped the outcome of this exploratory project.

### Step 1: Import the Modules and Corpra

---

> The first code block is used to initialize the modules within the notebook, or to 'get everything ready' so that they can be called up later in the document. The libraries that I am importing include the Natural Language Tool Kit, along with some data science libraries (numpy and pandas). Bokeh is used to visualize  the data at the end of the process.

> There are a wide variety of python libraries that overlap in functionality. I could have chosen other modules to work within, which could have very well led to a different outcome in this project. That is not to say that there's a grand disparity in objective work done between these libraries, but that there are a myriad of paths to get to an end result (whatever that may be). I chose to work with NLTK, numpy and pandas, and Bokeh since I am familiar with these libraries. Something to consider in the future is incorporating other libraries or tools to explore the data, as it may yield results leading to further insight in the study. 

In [1]:
import nltk
from nltk import FreqDist

import numpy as np
import pandas as pd
from datascience import *

from bokeh.io import output_file, show
from bokeh.models import ColumnDataSource, FixedTicker, PrintfTickFormatter
from bokeh.plotting import figure
from bokeh.embed import file_html
from bokeh.resources import CDN

from bokeh.palettes import Spectral9

> This second code block is to render the corpra within the document. In order for these texts to be "read" by the NLTK, they need to be parsed into tokens. This process is done for each corpus relating to an artist, resulting in six distinct collections of words.

> Initially, I wanted to use academic texts for this project. After trying to download and parse some PDFs relating to Han van Meegeren, I found that the OCR capabilities that I had handy weren't competent enough to actually read the PDFs. I made the decision to use text that I could access online -- specifically, grabbing text from the first couple of pages of a google search of the person. This method is problematic from a few standpoints, but I wanted to see (disregarding the algorithmic biases that would be present in the text) if there was a pattern that could be discerned at all from assembled text. As such, the following texts were gathered from initial results of a google search, which usually included biographic information, news articles, and a few exhibition summaries.

> It was apparent when adding text for Han Van Meegeren and Elmyr De Hory that the sources specifically focused on their forgeries, where the text for the other four artists were more broad in what they covered. As a result, there is an overrepresentation of certain words for Meegeren and De Hory compared to the other four artists. Future exploration in this area with varying sources for text (especially from an academic vs. news standpoint) would be rich. Do certain words show up more in layman publications versus academic ones?

In [2]:
# Open text file

meegeren_raw = open("C:\\Users\j'aam\Desktop\school\HCVIS final\meegren\meegren.txt", encoding="utf8")

# Tokenize it

Meegeren = nltk.word_tokenize(meegeren_raw.read())

dehory_raw = open("C:\\Users\j'aam\Desktop\school\HCVIS final\dehory.txt", encoding="utf8")
DeHory = nltk.word_tokenize(dehory_raw.read())

koons_raw = open("C:\\Users\j'aam\Desktop\school\HCVIS final\koons.txt", encoding="utf8")
Koons = nltk.word_tokenize(koons_raw.read())

prince_raw = open("C:\\Users\j'aam\Desktop\school\HCVIS final\prince.txt", encoding="utf8")
Prince = nltk.word_tokenize(prince_raw.read())

lawler_raw = open("C:\\Users\j'aam\Desktop\school\HCVIS final\lawler.txt", encoding="utf8")
Lawler = nltk.word_tokenize(lawler_raw.read())

sturtevant_raw = open("C:\\Users\j'aam\Desktop\school\HCVIS final\sturtevant.txt", encoding="utf8")
Sturtevant = nltk.word_tokenize(sturtevant_raw.read())

In [3]:
# Example of what the collected text looks like when it is tokenized

Meegeren

['Forgers',
 ',',
 'by',
 'nature',
 ',',
 'prefer',
 'anonymity',
 'and',
 'therefore',
 'are',
 'rarely',
 'remembered',
 '.',
 'An',
 'exception',
 'is',
 'Han',
 'van',
 'Meegeren',
 '(',
 '1889–1947',
 ')',
 '.',
 'Van',
 'Meegeren',
 "'s",
 'story',
 'is',
 'absolutely',
 'unique',
 'and',
 'may',
 'be',
 'justly',
 'considered',
 'the',
 'most',
 'dramatic',
 'art',
 'scam',
 'of',
 'the',
 'twentieth',
 'century',
 '.',
 'In',
 '1937',
 ',',
 'Abraham',
 'Bredius',
 ',',
 'who',
 'as',
 'one',
 'of',
 'the',
 'most',
 'authoritative',
 'art',
 'historians',
 'had',
 'dedicated',
 'a',
 'great',
 'part',
 'of',
 'his',
 'life',
 'to',
 'the',
 'study',
 'of',
 'Vermeer',
 ',',
 'was',
 'approached',
 'by',
 'a',
 'lawyer',
 'who',
 'claimed',
 'to',
 'be',
 'the',
 'trustee',
 'of',
 'a',
 'Dutch',
 'family',
 'estate',
 'in',
 'order',
 'to',
 'have',
 'him',
 'look',
 'at',
 'a',
 'rather',
 'large',
 'painting',
 'of',
 'a',
 'Christ',
 'and',
 'the',
 'Disciples',
 'at',
 'E

### Step 2: Define Terms, Create a Function

---

> This code block defines the array of terms that I wanted to see the frequency of in each text. As I have outlined with the thesis of this project, all of these words are similar but they certainly do not mean the same thing. In fact, they have very specific meanings as it relates to conceptions of crime and artistry. I wanted to see how these words, in their occurrence used in these texts relating to the artists, reflected perceptions of the artist and the objective term for their particular artwork. Are the authors of these texts (whoever they are) using synonyms to fluff up their work, or are there deliberate choices being made in how the work is described? 

> These words were chosen at random, and were not selected based on any preconceived notions or analytics pointing to their common use in the texts. To my surprise, there was only one word that did not occur at all ("knockoff"), and that all the other words chosen appeared at least once within the corpra. Further study in this project could be done in the word choice, along with more robust processing to identify sentiment being used in relation to these words and the artists. 

In [6]:
copy_terms = "forger", "forgery", "duplicate", "copy", "fraud", "fake", "appropriation", "appropriate", "replica", "replication", "reproduction", "facsimile", "imitation", "counterfeit", "knockoff", "dupe", "mimic", "plagiarism"

> This code block defines a function that I used to find the frequency of these words within each text. For each text that is passed through, the function uses the terms I defined in copy_terms and finds their frequency. It outputs a table with this information.

In [7]:
def make_word_freq_table(tokenlist):
    
    # Create a frequency distribution for the tokenized text
    
    freqDist = FreqDist(tokenlist)
    
    # Create an empty array to append with each term and the term's totals
    
    totals = make_array()
    
    # Iterate finding the frequency for each term in the copy_terms array
    
    for i in copy_terms:
        totals = np.append(totals, freqDist[i])
        
    # Create a table with the words and their corresponding frequency
        
    frequency = Table().with_columns(
        "Word", copy_terms,
        "Frequency", totals
        )
    
    # Output table
    
    return frequency

In [8]:
# An example of the function with the Meegeren tokenized text passed through it

make_word_freq_table(Meegeren).show()

Word,Frequency
forger,35
forgery,51
duplicate,0
copy,4
fraud,9
fake,21
appropriation,0
appropriate,1
replica,0
replication,1


### Step 3: Convert Table into a Dataframe for Visualization

> This code block converts the table into dataframe that can be more easily read and manipulated by the visualization tool. I could have used the dataframe library from the start of this project, but it was only when I began working with the visualization library that I realized I needed to have the frequencies in a dataframe. Converting the data mid-way through the project is not a best practice and could be something easily changed in future versions of this project to omit this extra step and allow for a seamless transition between steps.

> Regardless, it still works for this project in it's current form.

> Along with having to convert the table into a dataframe, I had to add rows with nothing but zeroes at the very beginning and end. I found this as a solution to "properly" showing the data in the visualization below, as otherwise the area glyphs created by Bokeh would not properly fill in.

> Another issue I ran into was that the data visualization library would not read the column "De Hory" as spaced, or "De_Hory" with an underscore. I had to change all instances of this artist's name to "DeHory" in order to please the data visualization deities.

In [9]:
# Creating a dataframe that has all of the frequencies collected for each artist using the function defined earlier.
# The part of the function that is needed is the frequency, not the words, so only the second column is selected.

data = {
        'Meegeren': make_word_freq_table(Meegeren).column(1), 
        'DeHory': make_word_freq_table(DeHory).column(1),
        'Koons': make_word_freq_table(Koons).column(1),
        'Prince': make_word_freq_table(Prince).column(1),
        'Lawler': make_word_freq_table(Lawler).column(1),
        'Sturtevant': make_word_freq_table(Sturtevant).column(1)
    }

# Refining the dataframe so that it can be parsed by the data visualization tool.
# Adding a row of zeroes to the beginning and end of the table so that the glyph will properly render in the next step.

df = pd.DataFrame(data=data)
df = pd.concat([df, df.tail(1)], axis=0)
df.iloc[-1] = 0.0
df.iloc[0] = 0.0

In [10]:
# Example of what the dataframe looks like in table form.
# While the words are omitted here, they will be added in the visualization in the next step.
# Each number corresponds with a word in the array (1 to forger, 2 to forgery, 3 to duplicate, etc.).

df

Unnamed: 0,Meegeren,DeHory,Koons,Prince,Lawler,Sturtevant
0,0.0,0.0,0.0,0.0,0.0,0.0
1,51.0,19.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,1.0
3,4.0,1.0,3.0,4.0,0.0,2.0
4,9.0,8.0,0.0,0.0,0.0,0.0
5,21.0,14.0,0.0,1.0,0.0,0.0
6,0.0,0.0,0.0,25.0,7.0,5.0
7,1.0,0.0,0.0,2.0,0.0,1.0
8,0.0,0.0,1.0,0.0,0.0,2.0
9,1.0,0.0,0.0,0.0,0.0,0.0


### Step 4: Data Visualization Code

---

> This is the code block in which I use the Bokeh data visualization library in order to visualize the data I've created. As mentioned before, there are a wide variety of libraries that I could have used for this step. I use Bokeh due to being familiar with it and I like it's interactivity features.

> The code to render the graph itself was mostly taken from the documentation on the Bokeh website for ridge graphs. The code had to be altered in order to suit my data, not only in the relevant inputs but in the structure as well.

> At this step I ran into several issues getting the data to successfully "visualize", or rather, to be visualized in a way that can even be interpreted by a human viewer. As mentioned before, I had made changes such as converting my data into a dataframe, changing names to suit the program (De Hory to DeHory), and adding rows of zeroes to have the glyphs render correctly. Another issue was that when I defined the x_range with the original copy_terms, the graph would render in such a way that the data was offset from the x axis. A change to bypass this was to manually define the x axis and override the numbers with the original terms. Such a change was necessary, but also more laborious than the intended functions of the libraries to reduce redundancy.

In [11]:
# Enter data into the visualization library's preferred data structure.
# This visualization library can be used without this step, but it can be difficult to render certain visualizations
# without using the library's data structure. 

source = ColumnDataSource(data=dict(x=df.index.values))

# Function used to create ridge glyphs, as created by the visualization library

def ridge(category, data, scale=0.075):
    return list(zip([category] * len(data), scale * data))

# Categories, each artist as a category

cats = list(['Meegeren', 'DeHory', 'Koons', 'Prince', 'Lawler', 'Sturtevant'])

# Defining the parameters of the graph itself

p = figure(
           y_range=cats, 
           plot_width=1000,
           toolbar_location=None, 
           title="Frequency of Terms In Reference to Artists"
           )

# Creating the ridge glyphs for each category and each frequency associated with it.

for i, cat in enumerate(cats):
    y = ridge(cat, df[cat])
    source.add(y, cat)
    p.patch('x', 
            cat, 
            color=Spectral9[i], 
            alpha=0.6, 
            line_color="black", 
            source=source
           )

# Visual options, such as making the font larger, changing the orientation of the words on the x axis
# changing the color of the lines within the graph, and adding padding to the y-axis 

p.title.text_font_size = "15pt"
p.xaxis.major_label_text_font_size = "10pt"
p.yaxis.major_label_text_font_size = "10pt"

p.xaxis.major_label_orientation = 1

p.outline_line_color = None
p.background_fill_color = "#efefef"

p.ygrid.grid_line_color = "#dddddd"
p.xgrid.grid_line_color = "#dddddd"

# As mentioned earlier, manual override of the x_range and x axis label

p.xaxis.ticker = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
p.xaxis.major_label_overrides = {0: "forger", 
                                 1: "forgery", 
                                 2: "duplicate",
                                 3: "copy",
                                 4: "fraud",
                                 5: "fake",
                                 6: "appropriation",
                                 7: "appropriate",
                                 8: "replica",
                                 9: "replication",
                                 10: "reproduction",
                                 11: "facsimile",
                                 12: "imitation",
                                 13: "counterfeit",
                                 14: "knockoff",
                                 15: "dupe",
                                 16: "mimic",
                                 17: "plagarism"}

p.y_range.range_padding = 0.12
p.xaxis.axis_line_width = 0

show(p)

In [124]:
html = file_html(p, CDN, "my plot")

In [126]:
print(html)


<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="utf-8">
        <title>my plot</title>
        
<link rel="stylesheet" href="https://cdn.pydata.org/bokeh/release/bokeh-0.12.16.min.css" type="text/css" />
        
<script type="text/javascript" src="https://cdn.pydata.org/bokeh/release/bokeh-0.12.16.min.js"></script>
<script type="text/javascript">
    Bokeh.set_log_level("info");
</script>
    </head>
    <body>
        
        <div class="bk-root">
            <div class="bk-plotdiv" id="b03c58be-13cc-4d66-a254-50edacabc80c"></div>
        </div>
        
        <script type="application/json" id="56d9af2e-f1be-4bf7-a7db-60a7b6df0d7f">
          {"1ef4b222-b9eb-4f06-8e97-e68faed0d702":{"roots":{"references":[{"attributes":{"fill_alpha":0.6,"fill_color":"#3288bd","line_alpha":0.6,"x":{"field":"x"},"y":{"field":"Meegeren"}},"id":"99405b78-a15e-474d-b3b8-d317ffd0280e","type":"Patch"},{"attributes":{"fill_alpha":0.6,"fill_color":"#e6f598","line_alpha":0.6,"x":{"field