# Checking and retrieving character indexes from quotations


What you will need to run this notebook:

+ The Project Gutenberg fulltext of your source text (text A). In this case, the Project Gutenberg version of *Middlemarch*: `middlemarch.txt`
+ The JSON file with the output of `text-matcher`. In this case, this is `default.json`

Both of these files must be in the same directory as this notebook for the filepaths below to run correctly.


In addition, you will need a list of the JSTOR article ids for the sample texts in the corpus.


### A preliminary note about  character indexes:

A match in text matcher takes the form of a pair, or a list of pairs, of character indexes. These character indexes store the position of a match and can be used to retrieve the corresponding text.

Let's say you were looking at an output :  [[173657, 173756], [292143, 292406]]. 

In each pair, the first number corresponds to the **starting character index**, and the second number corresponds to the **ending character index** of a quotation. 

So in this example, for match [173657, 173756].
+ the **starting charcter** is 173657
+ the **ending character** is 173756

### Import libraries
Run the cell below to import libraries

In [1]:
from text_matcher.matcher import Text, Matcher
import json
import pandas as pd
from IPython.display import clear_output
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = [16, 6]
#pd.set_option('display.max_colwidth', None)

### Load in our data files:

In [2]:
# Load Middlemarch .txt file 
# (Note: must have 'middlemarch.txt' in this directory)
with open('../middlemarch.txt') as f: 
    rawMM = f.read()

mm = Text(rawMM, 'Middlemarch')

# Load in the JSON file with our JSTOR articles and data from TextMatcher
# (Note: must have the file 'default.json' in the same directory as this notebook)
df = pd.read_json('../../default.json')

In [3]:
# Let's peek inside our DataFrame
df.head(3)

Unnamed: 0,creator,datePublished,docSubType,docType,fullText,id,identifier,isPartOf,issueNumber,language,...,title,url,volumeNumber,wordCount,numMatches,Locations in A,Locations in B,abstract,keyphrase,subTitle
0,[Rainer Emig],2006-01-01,book-review,article,"[Monika Mueller, George Eliot U.S.: Transat- l...",http://www.jstor.org/stable/41158244,"[{'name': 'issn', 'value': '03402827'}, {'name...",Amerikastudien / American Studies,3,[eng],...,Review Article,http://www.jstor.org/stable/41158244,51,1109,0,[],[],,,
1,[Martin Green],1970-01-01,book-review,article,[Reviews I57 Thackeray's Critics: An Annotated...,http://www.jstor.org/stable/3722819,"[{'name': 'issn', 'value': '00267937'}, {'name...",The Modern Language Review,1,[eng],...,Review Article,http://www.jstor.org/stable/3722819,65,1342,0,[],[],,,
2,[Richard Exner],1982-01-01,book-review,article,[Essays Mary McCarthy. Ideas and the Novel. Ne...,http://www.jstor.org/stable/40137021,"[{'name': 'issn', 'value': '01963570'}, {'name...",World Literature Today,1,[eng],...,Review Article,http://www.jstor.org/stable/40137021,56,493,0,[],[],,,


# Check quotation matches for particular articles


## Set the `article_id` ‼️

In the cell below, change the variable `article_id` to the id of the article you wish to exampine.

**Where can I find the article id?**

+ This can be found in the `id` column of URL of a given article.
+ For *Middlemarch*, please use the following article IDs: 
http://www.jstor.org/stable/41059781,
http://www.jstor.org/stable/2928567,
http://www.jstor.org/stable/25088885,
http://www.jstor.org/stable/462077,
http://www.jstor.org/stable/42827730,
http://www.jstor.org/stable/2933477,
http://www.jstor.org/stable/2873079,
http://www.jstor.org/stable/2932968,
http://www.jstor.org/stable/42827900,
http://www.jstor.org/stable/10.1525/ncl.2001.56.2.160,
http://www.jstor.org/stable/437748,
http://www.jstor.org/stable/27919123,
http://www.jstor.org/stable/2872038,
http://www.jstor.org/stable/3044620,
http://www.jstor.org/stable/591341,
http://www.jstor.org/stable/4334358,
http://www.jstor.org/stable/2933096,
http://www.jstor.org/stable/23539270,
http://www.jstor.org/stable/3751142,
http://www.jstor.org/stable/3825796,
http://www.jstor.org/stable/3826242,
http://www.jstor.org/stable/2932697,
http://www.jstor.org/stable/40754482,
http://www.jstor.org/stable/10.1525/ncl.2012.66.4.494,
http://www.jstor.org/stable/3828324,
http://www.jstor.org/stable/23099626,
http://www.jstor.org/stable/42965156,
http://www.jstor.org/stable/j.ctt155j8bf.9,
http://www.jstor.org/stable/3044863,
http://www.jstor.org/stable/2873139,
http://www.jstor.org/stable/3044571,
http://www.jstor.org/stable/29533514,
http://www.jstor.org/stable/42827934,
http://www.jstor.org/stable/43028240,
http://www.jstor.org/stable/30030019,
http://www.jstor.org/stable/40549795,
http://www.jstor.org/stable/25733489,
http://www.jstor.org/stable/1345484,
http://www.jstor.org/stable/27708593,
http://www.jstor.org/stable/27708062,
http://www.jstor.org/stable/3044589,
http://www.jstor.org/stable/42827827,
http://www.jstor.org/stable/25459494,
http://www.jstor.org/stable/439034


*Note: JSTOR outputs the fulltext of articles text as a list of strings, so we have to concatenate them using text-matcher;s `Text()` function.*


In [250]:
# ‼️ 🛑 Make sure to change the variable below to the correct article id 🛑  ‼️
article_id  = 'http://www.jstor.org/stable/3827821' # CHANGE THIS to article id

# Use article_id to get the index of the article in our DataFrame
article_index = df[df['id'] == article_id].index[0]
article_text = df['fullText'].loc[article_index]
article_title = df['title'].loc[article_index]

# Assign the full text of this article to a variable called `cleaned_article_text`, with text-matcher's Text function
cleaned_article_text = Text(article_text, article_title)

# Print out the title and ID of the article we selected as confirmation
print(f"""
Article selected:
ID: {article_id}
Title: {article_title}
""")



Article selected:
ID: http://www.jstor.org/stable/3827821
Title: "Cranford" and the Victorian Collection



## Part 1: Get quotes (& their character indexes) from `text-matcher` output


### What are the index positions of matches in our source text (Text "A")?
Retrieve the character indexes in for the source text (Text A):

In [251]:
# What are the locations in A?
print("Middlemarch character indexes:")
df.loc[df['id'] == article_id, 'Locations in A'].item()

Middlemarch character indexes:


[]

### What's the text of one of those matches?

Let's check the corresponding text in Middlemarch for one of the matches output above.  
Change the start and end character indexes to one of the index ranges in the cell above. 

In [252]:
#‼️ 🛑 IMPORTANT: Change the start and end character indexes to one of the ouputs above

#40542, 40576]

mm_start = 21520  # 🛑 REPLACE the number with one of the starting character indexes
mm_end = 22096 # 🛑 REPLACE the number with one of the ending character indexes

# Output the text in "A" for the start and end characters selected above
print("Middlemarch character indexes:", f"[{mm_start}, {mm_end}]")
mm.text[mm_start:mm_end]

Middlemarch character indexes: [21520, 22096]


'tely bending over her tapestry, until she heard her\nsister calling her.\n\n"Here, Kitty, come and look at my plan; I shall think I am a great\narchitect, if I have not got incompatible stairs and fireplaces."\n\nAs Celia bent over the paper, Dorothea put her cheek against her\nsister\'s arm caressingly.  Celia understood the action.  Dorothea saw\nthat she had been in the wrong, and Celia pardoned her.  Since they\ncould remember, there had been a mixture of criticism and awe in the\nattitude of Celia\'s mind towards her elder sister.  The younger had\nalways worn a yoke; but is th'

### What are the indexes positions of matches in our target text (Text "B")?
Retrieve the indexes in the B text (that is, the article index: 

In [253]:
# What are the locations in B?
print(f"Character index locations for {article_id}:")
df.loc[df['id'] == article_id, 'Locations in B'].item()

Character index locations for http://www.jstor.org/stable/3827821:


[]

In [254]:
#‼️ 🛑 IMPORTANT: Change the start and end character indexes to one of the ouputs above

textB_start = 10882   # 🛑 REPLACE the number to the left with one of the starting character indexes
textB_end =  10940 # 🛑 REPLACE the number to the left with one of the ending character indexes

# Output the text in "B" for the start and end characters selected above 
print(f"Character index locations for {article_id}:", f"[{textB_start}, {textB_end}]")
cleaned_article_text.text[textB_start:textB_end]

Character index locations for http://www.jstor.org/stable/3827821: [10882, 10940]


'"at the same time bewildering and dull" (69) could equally'

### What's the text of one of those matches in Text "B" (the article)?
Change the start and end character indexes to one of the index ranges in the cell above.

---

## Find the index positions of a given quotation

To establish all of the "ground truth" quotations (and their character indexes), we'll want to get the index characters not just for quotations that text-matcher successfully matched, but for *all* quotations in that article.

To retrieve the index characters for all quotations in an article legilbe to human eyes, follow the following steps.


### Step 1: Locate the quotation in the PDF of the article.

### Step 2:  Locate the text of that quotation as it appears in the JSON file in the ""fullText" field
(🛑 Make sure you've entered the `article_id` for the article in the section "Set the `article_id`", first!!)  
Run the cell below, and then use "CTRL+F" in your browser to find the quotation as it appears in the article text.

In [None]:
print(cleaned_article_text.text)

### Step 3: Copy that text of the quotation as it appears exactly in the article text above.

### Step 4: Paste the text of the quotation in the `quotation` field below
Make sure that you enclose the quotation in quotation marks.

If there are are quotation marks in the text of the quote, either place an escape character `\` in front of them, or change the quotation marks that you use. (Eg, if there are single quotes (`'`) in the text, use double quotes (`"`) to surround the text.

Run the cell below.

In [None]:
# PASTE the quotation below in the field, replacing the text below ‼️
# Make sure to include quotation marks around the string
quotation = 'over that tempting range of relevancies called the universe'
index = cleaned_article_text.text.rindex(quotation)
print(f"Article id: {article_id}")
print('Starting index:', index) 
print('Ending index:', index + len(quotation))
print(f'\nQuotation and character indexes to paste into spreadsheet:\n\n{quotation} \t[{index}, {index + len(quotation)}]')
print("\nSanity check text:")
cleaned_article_text.text[index:index + len(quotation)]



### Step 5: Record the character indexes and article id in spreadsheet
Add the character indexes and article ID as a new row in a spreadsheet

---

### Reading data in for sanity checks
Sanity check: make sure that the quotation pasted by a student appears in JSONL

In [5]:
student_df = pd.read_csv('../data/middlemarch-ground-truth-indexes-updated.csv', encoding = 'utf-8')
student_df.head()

Unnamed: 0,id,Author name,Year,Page number,Quotation from PDF,Quotation from JSONL full text,Ground-truth character indexes,Notes
0,Book chapter,“Notes”,2006,p. 533,““The Doctor was more than suspected of having...,,,
1,http://www.jstor.org/stable/30030019,Albrecht,2006,p. 443,"“All of us, grave or light, get our thoughts e...","All of us, grave or light, get our thoughts en...","[14718, 14825]","addition of ""of them"""
2,http://www.jstor.org/stable/30030019,Albrecht,2006,p. 446,"“""known merely as a cluster of signs for \[her...",known merely as a cluster of signs for his nei...,"[64756, 64829]","""\[her\]"" replaced with ""his"", deletion of ""1"""
3,http://www.jstor.org/stable/30030019,Albrecht,2006,p. 450,"“""Rosamond, in fact, was entirely occupied not...","Rosamond, in fact, was entirely occupied not e...","[36183, 36307]","deletion of ""23"""
4,http://www.jstor.org/stable/30030019,Albrecht,2006,p. 452,"“""virtually unknown""”",virtually unknown,"[64738, 64755]",three instances of this quotation


In [7]:
for item in student_df['Ground-truth character indexes']:
    print(type(item))

<class 'float'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<cla

In [8]:
def string_to_list(my_string):
    numeric_list =[]
    try:
        for x in my_string.split(','):
            if x != '[' or x != ']':
                x = x.strip('[]\n')
                numeric_list.append(int(x))
            else:
                pass
        return numeric_list
    except AttributeError:
        pass

In [9]:
# Create a new list called "cleaned_ground_indexes"
cleaned_ground_indexes = []

for item in student_df['Ground-truth character indexes']:
    cleaned_ground_indexes.append(string_to_list(item))
cleaned_ground_indexes

[None,
 [14718, 14825],
 [64756, 64829],
 [36183, 36307],
 [64738, 64755],
 [49114, 49181],
 [64549, 64829],
 [74932, 75029],
 [3228, 3249],
 [3834, 3846],
 [3948, 3985],
 [4020, 4031],
 [4047, 4071],
 [4233, 4352],
 [4976, 4991],
 [5129, 5148],
 [9232, 9254],
 [9290, 9656],
 [10546, 10627],
 [12614, 12698],
 [12719, 12748],
 [12896, 12900],
 [13333, 13345],
 [13431, 13644],
 [13835, 14026],
 [14170, 14203],
 [14217, 14247],
 [14366, 14451],
 [14503, 14511],
 [14938, 15142],
 [15185, 15191],
 [15201, 15281],
 [15655, 15670],
 [15937, 16000],
 [16563, 16714],
 [16867, 16901],
 [16931, 16965],
 [16982, 17067],
 [17146, 17267],
 [17428, 17447],
 [18147, 18165],
 [18234, 18264],
 [18723, 18748],
 [18820, 18887],
 [18985, 19006],
 [19038, 19064],
 [19438, 19618],
 [20185, 20305],
 [20523, 20557],
 [20563, 20568],
 [21147, 21397],
 [21447, 21484],
 [21642, 21727],
 [22025, 22031],
 [22063, 22073],
 [22216, 22228],
 [22255, 22414],
 [22860, 22865],
 [23145, 23261],
 [23282, 23390],
 [23842, 2

In [10]:
student_df['Ground-truth character indexes'] = cleaned_ground_indexes

In [11]:
for item in student_df['Ground-truth character indexes']:
    print(type(item))

<class 'NoneType'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class 'list'>
<class

## Sanity Checks 1: Does the quotation match the JSONL?
Sanity check: make sure that the quotation pasted by a student appears in JSONL

### What's the text of one of those matches?

Let's check the corresponding text in JSONL for one of the matches output above.  
Change the start and end character indexes to one of the index ranges in the cell above. 

In [None]:
for stuff in range(len(student_df)):
    try:
        if stuff is not None:
            quote = student_df['Quotation from JSONL full text'].loc[stuff]
            start = student_df['Ground-truth character indexes cleaned'].loc[stuff][0]
            end = student_df['Ground-truth character indexes cleaned'].loc[stuff][1]
            article_id_new = student_df['id'].loc[stuff]
            print(f'ID:{article_id_new}')
            print(f'start:{start}')
            print(f'end:{end}')
            print(f'Quotation from JSONL full text: {quote}')
            article_index = df[df['id'] == article_id_new].index[0]
            article_text = df['fullText'].loc[article_index]
            article_title = df['title'].loc[article_index]
            # Assign the full text of this article to a variable called `cleaned_article_text`, with text-matcher's Text function
            cleaned_article_text = Text(article_text, article_title)
            print(f'Sanity check:{cleaned_article_text.text[start:end]}\n\n')
    except TypeError:
        pass#print(item)
    #print(ststudent_df['Ground-truth character indexes cleaned']

### Sanity Check 2: Does the quotation appear in Middlemarch?
Sanity check: for the dataset, check to make sure that JSONL full text matches a string in Middlematch, print an error message if not

In [None]:
# PASTE the quotation below in the field, replacing the text below ‼️
# Make sure to include quotation marks around the string
quotation = "coherent social faith and order which could perform the function of knowledge for the ardently willing soul"
index = mm.text.find(quotation)
print(mm.text[index:index + len(quotation)])

### Sanity Check 3
Checking what text matcher misclassified 

In [None]:
misclassifications_df = pd.read_csv('../../../Middlematch/hyperparameter-data/text-matcher-misclassifications/text-matcher-misclassifications-default.csv', encoding='utf-8')

In [None]:
def string_to_list(my_string):
    numeric_list =[]
    try:
        for x in my_string.split(','):
            if x != '[' or x != ']':
                x = x.strip('[]\n')
                numeric_list.append(int(x))
            else:
                pass
        return numeric_list
    except AttributeError:
        pass

In [None]:
cleaned_indexes = []

for item in misclassifications_df['Text-matcher character indexes']:
    cleaned_indexes.append(string_to_list(item))
cleaned_indexes

misclassifications_df['cleaned indexes'] = cleaned_indexes

### Print out the article id, indexs, and text in the article that text-matcher identified as a match

In [None]:
# Print out the article id, indexes, and text in the article that text-matcher identified as a match
for row in range(len(misclassifications_df)):
    article_id  = misclassifications_df['id'].iloc[row]
    character_indexes = misclassifications_df['cleaned indexes'].iloc[row]
    textB_start = character_indexes[0]
    textB_end  = character_indexes[1]
    print(f"Character index locations for {article_id}:", f"[{textB_start}, {textB_end}]")
    print(cleaned_article_text.text[textB_start:textB_end])
    print("\n")


## Checking misclassifications

In [None]:
import pandas as pd
from pathlib import Path

### Let's look at ALL misclassifications from all hyperparameter settings

In [None]:
misclassifications = "../../../Middlematch/hyperparameter-data/text-matcher-misclassifications/"

In [None]:
misclassifications_df = pd.concat((pd.read_csv(filename) for filename in Path(misclassifications).glob('*.csv')))

In [None]:
misclassifications_df = misclassifications_df.drop(columns=["Unnamed: 0"], axis=1)
misclassifications_df

In [None]:
misclassifications_df['Text-matcher character indexes'].value_counts()

In [None]:
misclassifications_df = misclassifications_df.drop_duplicates()
misclassifications_df

In [None]:
cleaned_text_matcher_indexes = []

for item in misclassifications_df['Text-matcher character indexes']:
    cleaned_text_matcher_indexes.append(string_to_list(item))
cleaned_text_matcher_indexes

In [None]:
misclassifications_df['Text-matcher character cleaned indexes'] = cleaned_text_matcher_indexes

In [None]:

for row in range(len(misclassifications_df)):
    article_id  = misclassifications_df['id'].iloc[row]
    article_index = df[df['id'] == article_id].index[0]
    article_text = df['fullText'].loc[article_index]
    cleaned_article_text = Text(article_text, article_id)
    character_indexes = misclassifications_df['Text-matcher character cleaned indexes'].iloc[row]
    textB_start = character_indexes[0]
    textB_end  = character_indexes[1]
    print(f"Character index locations for {article_id}:", f"[{textB_start}, {textB_end}]")
    print(cleaned_article_text.text[textB_start:textB_end])
    print("\n")
        

## To check misclassifications from specific hyperparameter settings

In [339]:
name_of_hyperparameter = "t1-c1-n1-m5-with-stops"

In [340]:
misclassifications_df = pd.read_csv(f"../../../Middlematch/hyperparameter-data/text-matcher-misclassifications/text-matcher-misclassifications-{name_of_hyperparameter}.csv")

In [341]:
misclassifications_df = misclassifications_df.drop(columns=["Unnamed: 0"], axis=1)
misclassifications_df

Unnamed: 0,id,Text-matcher character indexes,Actual,Match-detected
0,http://www.jstor.org/stable/27708593,"[286, 2609]",0,1
1,http://www.jstor.org/stable/27708593,"[3320, 3331]",0,1
2,http://www.jstor.org/stable/27708593,"[3471, 3479]",0,1
3,http://www.jstor.org/stable/27708593,"[3480, 3489]",0,1
4,http://www.jstor.org/stable/27708593,"[4391, 4399]",0,1
...,...,...,...,...
1806,http://www.jstor.org/stable/3825796,"[59162, 59175]",0,1
1807,http://www.jstor.org/stable/3825796,"[59406, 59414]",0,1
1808,http://www.jstor.org/stable/3825796,"[59794, 59814]",0,1
1809,http://www.jstor.org/stable/3825796,"[60315, 60322]",0,1


In [342]:
misclassifications_df['Text-matcher character indexes'].value_counts()

[34424, 34431]    2
[71136, 71145]    1
[34568, 34574]    1
[8989, 8995]      1
[38046, 38071]    1
                 ..
[44569, 44579]    1
[76622, 76633]    1
[32860, 32868]    1
[24126, 24141]    1
[11598, 11605]    1
Name: Text-matcher character indexes, Length: 1810, dtype: int64

In [343]:
misclassifications_df = misclassifications_df.drop_duplicates()
misclassifications_df

Unnamed: 0,id,Text-matcher character indexes,Actual,Match-detected
0,http://www.jstor.org/stable/27708593,"[286, 2609]",0,1
1,http://www.jstor.org/stable/27708593,"[3320, 3331]",0,1
2,http://www.jstor.org/stable/27708593,"[3471, 3479]",0,1
3,http://www.jstor.org/stable/27708593,"[3480, 3489]",0,1
4,http://www.jstor.org/stable/27708593,"[4391, 4399]",0,1
...,...,...,...,...
1806,http://www.jstor.org/stable/3825796,"[59162, 59175]",0,1
1807,http://www.jstor.org/stable/3825796,"[59406, 59414]",0,1
1808,http://www.jstor.org/stable/3825796,"[59794, 59814]",0,1
1809,http://www.jstor.org/stable/3825796,"[60315, 60322]",0,1


In [344]:
cleaned_text_matcher_indexes = []

for item in misclassifications_df['Text-matcher character indexes']:
    cleaned_text_matcher_indexes.append(string_to_list(item))
cleaned_text_matcher_indexes

[[286, 2609],
 [3320, 3331],
 [3471, 3479],
 [3480, 3489],
 [4391, 4399],
 [4400, 4409],
 [4507, 4517],
 [4521, 4534],
 [5926, 5936],
 [7599, 7613],
 [8041, 8050],
 [8443, 8454],
 [8653, 8671],
 [8692, 8714],
 [8724, 8735],
 [8964, 8975],
 [9361, 9376],
 [10146, 10156],
 [10897, 10905],
 [11905, 11913],
 [12136, 12150],
 [12795, 12802],
 [13605, 13613],
 [13998, 14007],
 [14191, 14198],
 [14603, 14619],
 [18766, 18774],
 [19208, 19216],
 [23142, 23148],
 [65120, 65133],
 [65360, 65368],
 [69345, 69352],
 [69911, 69921],
 [70379, 70399],
 [70887, 70898],
 [71582, 71590],
 [71755, 71764],
 [72049, 72060],
 [72117, 72126],
 [72672, 72678],
 [73211, 73219],
 [73287, 73301],
 [73770, 73777],
 [73799, 73808],
 [73931, 73942],
 [74049, 74081],
 [96, 106],
 [2155, 2163],
 [2255, 2262],
 [2609, 2617],
 [6028, 6049],
 [8811, 8817],
 [9515, 9537],
 [9563, 9570],
 [9611, 9621],
 [11198, 11211],
 [11825, 11845],
 [11981, 11991],
 [13253, 13261],
 [13331, 13752],
 [14176, 14184],
 [14367, 14375],
 [

In [345]:
misclassifications_df['Text-matcher character cleaned indexes'] = cleaned_text_matcher_indexes

In [346]:
with open(f"../Middlematch/hyperparameter-data/text-matcher-misclassifications/misclassifications_text_{name_of_hyperparameter}.txt", "a") as f:
    for row in range(len(misclassifications_df)):
        article_id  = misclassifications_df['id'].iloc[row]
        article_index = df[df['id'] == article_id].index[0]
        article_text = df['fullText'].loc[article_index]
        cleaned_article_text = Text(article_text, article_id)
        character_indexes = misclassifications_df['Text-matcher character cleaned indexes'].iloc[row]
        textB_start = character_indexes[0]
        textB_end  = character_indexes[1]
        print(f"Character index locations for {article_id}:", f"[{textB_start}, {textB_end}]", file=f)
        print(cleaned_article_text.text[textB_start:textB_end], file=f)
        print("\n", file=f)
        

## Checking the text of spematches from specific journals

In [3]:
# Load in the JSON file with our JSTOR articles and data from TextMatcher
# (Note: must have the file 'default.json' in the same directory as this notebook)
hyperparameter_df = pd.read_json(f'../../../Middlematch/hyperparameter-data/default.json')
hyperparameter_df

Unnamed: 0,creator,datePublished,docSubType,docType,id,identifier,isPartOf,issueNumber,language,outputFormat,...,title,url,volumeNumber,wordCount,numMatches,Locations in A,Locations in B,abstract,keyphrase,subTitle
0,[Rainer Emig],2006-01-01,book-review,article,http://www.jstor.org/stable/41158244,"[{'name': 'issn', 'value': '03402827'}, {'name...",Amerikastudien / American Studies,3,[eng],"[unigram, bigram, trigram]",...,Review Article,http://www.jstor.org/stable/41158244,51,1109,0,[],[],,,
1,[Martin Green],1970-01-01,book-review,article,http://www.jstor.org/stable/3722819,"[{'name': 'issn', 'value': '00267937'}, {'name...",The Modern Language Review,1,[eng],"[unigram, bigram, trigram]",...,Review Article,http://www.jstor.org/stable/3722819,65,1342,0,[],[],,,
2,[Richard Exner],1982-01-01,book-review,article,http://www.jstor.org/stable/40137021,"[{'name': 'issn', 'value': '01963570'}, {'name...",World Literature Today,1,[eng],"[unigram, bigram, trigram]",...,Review Article,http://www.jstor.org/stable/40137021,56,493,0,[],[],,,
3,[Ruth Evelyn Henderson],1925-10-01,research-article,article,http://www.jstor.org/stable/802346,"[{'name': 'issn', 'value': '00138274'}, {'name...",The English Journal,8,[eng],"[unigram, bigram, trigram, fullText]",...,American Education Week--November 16-22; Some ...,http://www.jstor.org/stable/802346,14,2161,0,[],[],,,
4,[Alan Palmer],2011-12-01,research-article,article,http://www.jstor.org/stable/10.5325/style.45.4...,"[{'name': 'issn', 'value': '00394238'}, {'name...",Style,4,[eng],"[unigram, bigram, trigram]",...,Rejoinder to Response by Marie-Laure Ryan,http://www.jstor.org/stable/10.5325/style.45.4...,45,1127,0,[],[],,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5879,[Michaela Giesenkirchen],2005-10-01,research-article,article,http://www.jstor.org/stable/27747183,"[{'name': 'issn', 'value': '15403084'}, {'name...",American Literary Realism,1,[eng],"[unigram, bigram, trigram]",...,Ethnic Types and Problems of Characterization ...,http://www.jstor.org/stable/27747183,38,7349,1,"[[23799, 24121]]","[[41472, 41793]]",,,
5880,[Leon Botstein],2005-07-01,misc,article,http://www.jstor.org/stable/4123220,"[{'name': 'issn', 'value': '00274631'}, {'name...",The Musical Quarterly,2,[eng],"[unigram, bigram, trigram]",...,On the Power of Music,http://www.jstor.org/stable/4123220,88,1525,0,[],[],,,
5881,[Linda M. Shires],2013-01-01,research-article,article,http://www.jstor.org/stable/24575734,"[{'name': 'issn', 'value': '10601503'}, {'name...",Victorian Literature and Culture,4,[eng],"[unigram, bigram, trigram]",...,"HARDY'S MEMORIAL ART: IMAGE AND TEXT IN ""WESSE...",http://www.jstor.org/stable/24575734,41,10736,1,"[[173657, 173756]]","[[33963, 34061]]",,,
5882,[Edward H. Cohen],1990-07-01,misc,article,http://www.jstor.org/stable/3827815,"[{'name': 'issn', 'value': '00425222'}, {'name...",Victorian Studies,4,[eng],"[unigram, bigram, trigram]",...,Victorian Bibliography for 1989,http://www.jstor.org/stable/3827815,33,81819,0,[],[],,,


In [4]:
# Load in the CSV file with our Victorian Studies articles and data from TextMatcher
vs_df = pd.read_csv(f'../Middlematch/victorian_studies_quotations.csv')

In [5]:
# Let's just look at the data for our Victorian Studies dataset
allowed_values = vs_df['id']
df_filtered = hyperparameter_df[hyperparameter_df['id'].isin(allowed_values)]
victorian_studies_df = df_filtered

In [6]:
victorian_studies_df

Unnamed: 0,creator,datePublished,docSubType,docType,id,identifier,isPartOf,issueNumber,language,outputFormat,...,title,url,volumeNumber,wordCount,numMatches,Locations in A,Locations in B,abstract,keyphrase,subTitle
39,[A. S. Crehan],1976-03-01,research-article,article,http://www.jstor.org/stable/3826133,"[{'name': 'issn', 'value': '00425222'}, {'name...",Victorian Studies,3,[eng],"[unigram, bigram, trigram]",...,Victorian Literature: Materials for Teaching a...,http://www.jstor.org/stable/3826133,19,13133,0,[],[],,,
41,[Ronald E. Freeman],1968-06-01,misc,article,http://www.jstor.org/stable/3825239,"[{'name': 'issn', 'value': '00425222'}, {'name...",Victorian Studies,4,[eng],"[unigram, bigram, trigram]",...,Victorian Bibliography for 1967,http://www.jstor.org/stable/3825239,11,36967,0,[],[],,,
88,[Sally Shuttleworth],2015-10-01,research-article,article,http://www.jstor.org/stable/10.2979/victorians...,"[{'name': 'issn', 'value': '00425222'}, {'name...",Victorian Studies,1,[eng],"[unigram, bigram, trigram]",...,"Childhood, Severed Heads, and the Uncanny: Fre...",http://www.jstor.org/stable/10.2979/victorians...,58,11238,0,[],[],Freud's theories of the uncanny are generally ...,,
394,[Michael Millgate],1966-03-01,book-review,article,http://www.jstor.org/stable/3825715,"[{'name': 'issn', 'value': '00425222'}, {'name...",Victorian Studies,3,[eng],"[unigram, bigram, trigram]",...,Review Article,http://www.jstor.org/stable/3825715,9,1462,0,[],[],,,
459,[William Patrick Day],1989-07-01,book-review,article,http://www.jstor.org/stable/3828259,"[{'name': 'issn', 'value': '00425222'}, {'name...",Victorian Studies,4,[eng],"[unigram, bigram, trigram]",...,Review Article,http://www.jstor.org/stable/3828259,32,2540,0,[],[],,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5454,[Nancy Aycock Metz],1990-04-01,research-article,article,http://www.jstor.org/stable/3827697,"[{'name': 'issn', 'value': '00425222'}, {'name...",Victorian Studies,3,[eng],"[unigram, bigram, trigram]",...,"""Little Dorrit's"" London: Babylon Revisited",http://www.jstor.org/stable/3827697,33,9559,0,[],[],,,
5545,[Elaine Freedgood],2003-07-01,research-article,article,http://www.jstor.org/stable/3829530,"[{'name': 'issn', 'value': '00425222'}, {'name...",Victorian Studies,4,[eng],"[unigram, bigram, trigram]",...,"""Fine Fingers"": Victorian Handmade Lace and Ut...",http://www.jstor.org/stable/3829530,45,9287,0,[],[],,,
5765,[K. K. Collins],1978-07-01,research-article,article,http://www.jstor.org/stable/3827594,"[{'name': 'issn', 'value': '00425222'}, {'name...",Victorian Studies,4,[eng],"[unigram, bigram, trigram]",...,G. H. Lewes Revised: George Eliot and the Mora...,http://www.jstor.org/stable/3827594,21,14095,1,"[[913852, 914721]]","[[52941, 53809]]",,,
5828,[Sara Murphy],2014-07-01,book-review,article,http://www.jstor.org/stable/10.2979/victorians...,"[{'name': 'issn', 'value': '00425222'}, {'name...",Victorian Studies,4,[eng],"[unigram, bigram, trigram]",...,Review Article,http://www.jstor.org/stable/10.2979/victorians...,56,2194,0,[],[],,,


In [20]:
victorian_studies_df = victorian_studies_df[['id','Locations in A', 'Locations in B']]
victorian_studies_df

Unnamed: 0,id,Locations in A,Locations in B
1645,http://www.jstor.org/stable/3828663,"[302998, 303063]","[3630, 3695]"
1818,http://www.jstor.org/stable/20537403,"[42574, 42644]","[18758, 18828]"
1889,http://www.jstor.org/stable/3825506,"[560626, 560784]","[3047, 3206]"
3419,http://www.jstor.org/stable/10.2979/victorians...,"[25917, 26218]","[17052, 17375]"
3419,http://www.jstor.org/stable/10.2979/victorians...,"[25962, 26223]","[17122, 17380]"
3419,http://www.jstor.org/stable/10.2979/victorians...,"[77537, 77617]","[18322, 18397]"
3419,http://www.jstor.org/stable/10.2979/victorians...,"[291679, 291940]","[24621, 24882]"
3419,http://www.jstor.org/stable/10.2979/victorians...,"[349604, 350154]","[28459, 29008]"
3419,http://www.jstor.org/stable/10.2979/victorians...,"[421728, 421994]","[32194, 32459]"
3419,http://www.jstor.org/stable/10.2979/victorians...,"[1664676, 1664798]","[33410, 33532]"


In [16]:
victorian_studies_df = victorian_studies_df[victorian_studies_df['Locations in B'].notna()]

In [None]:
victorian_studies_df = victorian_studies_df.apply(pd.Series.explode)
victorian_studies_df

In [18]:
ids = []
locations_in_A_VS_chap15 = []
locations_in_B_VS_chap15 = []

    
for i, row in victorian_studies_df.iterrows(): 
    starts = row['Locations in A'] 
    locations = row['Locations in B'] 
    if starts[0] > 290371 and starts[0] < 322052: # Does it cite Chapter XV? 
        locations_in_A_VS_chap15.append(row['Locations in A'])
        locations_in_B_VS_chap15.append(locations)
        ids.append(row.id)
victorian_studies_chap15_df = pd.DataFrame(list(zip(ids, locations_in_A_VS_chap15, locations_in_B_VS_chap15)), columns = ['id', 'Locations in A', 'Locations in B'])

In [19]:
victorian_studies_chap15_df

Unnamed: 0,id,Locations in A,Locations in B
0,http://www.jstor.org/stable/3828663,"[302998, 303063]","[3630, 3695]"
1,http://www.jstor.org/stable/10.2979/victorians...,"[291679, 291940]","[24621, 24882]"


In [21]:
with open(f"../../../Middlematch/victorian_studies_chapter_15_default.txt", "a") as f:
    for row in range(len(victorian_studies_chap15_df)):
        article_id  = victorian_studies_chap15_df['id'].iloc[row]
        article_index = df[df['id'] == article_id].index[0]
        article_text = df['fullText'].loc[article_index]
        cleaned_article_text = Text(article_text, article_id)
        character_indexes = victorian_studies_chap15_df['Locations in B'].iloc[row]
        textB_start = character_indexes[0]
        textB_end  = character_indexes[1]
        print(f"Character index locations for {article_id}:", f"[{textB_start}, {textB_end}]", file=f)
        print(cleaned_article_text.text[textB_start:textB_end], file=f)
        print("\n", file=f)
   