<a href="https://colab.research.google.com/github/cheungkelly/DATA_301/blob/main/Day_10_Introduction_to_Text_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Text Data

The reading from today showed by how to compute things like TF-IDF and cosine distance "from scratch" and using scikit-learn. We will generally rely on scikit-learn, so that is how you should solve these problems. However, you might want to do a few parts from scratch if you want to implement the process a little more concretely.

In [None]:
import pandas as pd

## The Gospels

The Christian Bible is a collection of books. Four of these books (Matthew, Mark, Luke, John) tell the life of Jesus; these 4 books are known as the "Gospels".

The text of the four books are stored in four files:
- Matthew: http://dlsun.github.io/stats112/data/gospels/matthew.txt
- Mark: http://dlsun.github.io/stats112/data/gospels/mark.txt
- Luke: http://dlsun.github.io/stats112/data/gospels/luke.txt
- John: http://dlsun.github.io/stats112/data/gospels/john.txt

The following reads the four texts into a list called `corpus`.

In [None]:
dir = "http://dlsun.github.io/stats112/data/gospels/"
gospel_files = ["matthew.txt", "mark.txt", "luke.txt", "john.txt"]

In [None]:
import requests

corpus = []
for text in gospel_files:
  response = requests.get(dir + text)
  corpus.append(response.text)

1\. Construct the term-frequency matrix for this corpus, and calculate the Euclidean distances between all pairs of gospels. Based on this measure, which two gospels are most similar? Most different?

In [None]:
# YOUR CODE HERE. ADD CELLS AS NEEDED
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

vectorizer = CountVectorizer()

tf_matrix = vectorizer.fit_transform(corpus).toarray()

gospel_names = ["Matthew", "Mark", "Luke", "John"]
tf_df = pd.DataFrame(tf_matrix, index=gospel_names, columns=vectorizer.get_feature_names_out())

tf_df.head()


Unnamed: 0,aaron,abased,abba,abel,abia,abiathar,abide,abideth,abiding,abilene,...,youth,zabulon,zacchaeus,zacharias,zara,zeal,zebedee,zebedees,zelotes,zorobabel
Matthew,0,1,0,1,2,0,1,0,0,0,...,1,2,0,1,1,0,4,2,0,2
Mark,0,0,1,0,0,1,1,0,0,0,...,1,0,0,0,0,0,4,0,0,0
Luke,1,2,0,1,1,0,3,0,1,1,...,1,0,3,10,0,0,1,0,1,1
John,0,0,0,0,0,0,10,6,1,0,...,0,0,0,0,0,1,1,0,0,0


In [None]:
from scipy.spatial import distance

euclidean_distances = pd.DataFrame(index=gospel_names, columns=gospel_names)

for gospel1 in gospel_names:
    for gospel2 in gospel_names:
        euclidean_distances.at[gospel1, gospel2] = distance.euclidean(tf_df.loc[gospel1], tf_df.loc[gospel2])

upper_triangle = np.triu(np.ones_like(euclidean_distances, dtype=bool), k=1)
most_similar_pair = np.unravel_index(np.argmin(euclidean_distances.values[upper_triangle]), euclidean_distances.shape)
most_different_pair = np.unravel_index(np.argmax(euclidean_distances.values[upper_triangle]), euclidean_distances.shape)

euclidean_distances, (gospel_names[most_similar_pair[0]], gospel_names[most_similar_pair[1]]), (gospel_names[most_different_pair[0]], gospel_names[most_different_pair[1]])


(            Matthew         Mark         Luke         John
 Matthew         0.0   911.938594   634.615632   966.104032
 Mark     911.938594          0.0  1223.143082   767.534364
 Luke     634.615632  1223.143082          0.0  1332.786555
 John     966.104032   767.534364  1332.786555          0.0,
 ('Matthew', 'Mark'),
 ('Mark', 'Mark'))

2\. Calculate the cosine distances between all pairs of gospels. What do you notice now? How does your conclusion compare to part 1?

In [None]:
# YOUR CODE HERE. ADD CELLS AS NEEDED


3\. Construct the TF-IDF matrix, and calculate the cosine distances between all pairs of gospels. What is your conclusion now?

In [None]:
# YOUR CODE HERE. ADD CELLS AS NEEDED

4\. Which of the three analyses/conclusions do you think is most appropriate? Why?

**YOUR RESPONSE HERE.**

You have just discovered a phenomenon known to critical Biblical scholars as the [Synoptic problem](https://en.wikipedia.org/wiki/Synoptic_Gospels)!

## Historical Documents

Five famous texts from American history are contained in the files listed below. Use the tools that we have covered to determine which two of these documents are most similar to each other. Specify how you determined this, including any choices you made along the way.

In [None]:
dir = "http://dlsun.github.io/stats112/data/texts/"
texts = [
    "declaration_of_independence.txt",
    "give_me_liberty.txt",
    "declaration_of_sentiments.txt",
    "gettysburg_address.txt",
    "i_have_a_dream.txt"]

In [None]:
# YOUR CODE HERE. ADD CELLS AS NEEDED

## Authorship of the Federalist Papers

The _Federalist Papers_ were a set of 85 essays published between 1787 and 1788 to promote the ratification of the United States Constitution. They were originally published under the pseudonym "Publius". Although the identity of the authors was a closely guarded secret at the time, most of the papers have since been conclusively attributed to one of [Hamilton](https://www.youtube.com/watch?v=_YHVPNOHySk), Jay, or Madison. The known authorships can be found in `https://dlsun.github.io/pods/data/federalist/authorship.csv`.

For 15 of the papers, however, the authorships remain disputed. (These papers can be identified from the `authorship.csv` file because the "Author" field is blank.) In this analysis, you will use the papers with known authorship to predict the authorships of the disputed papers. The text of each paper is available at `https://dlsun.github.io/pods/data/federalist/x.txt`, where `x` is the number of the paper (i.e., a number from 1 to 85). The name of the file indicates the number of the paper.

Here are the authors

In [None]:
authors = pd.read_csv("https://dlsun.github.io/pods/data/federalist/authorship.csv", index_col = "Paper")

authors

Unnamed: 0_level_0,Author
Paper,Unnamed: 1_level_1
1,Hamilton
2,Jay
3,Jay
4,Jay
5,Jay
...,...
81,Hamilton
82,Hamilton
83,Hamilton
84,Hamilton


And here are the texts

In [None]:
import requests

fed_dir = "https://dlsun.github.io/pods/data/federalist/"

docs_fed = pd.Series()

for number in range(1, 86):
  file = fed_dir + "{}.txt".format(number)
  response = requests.get(file, "r")
  docs_fed[str(number)] = response.text

docs_fed

  docs_fed = pd.Series()


1     To the People of the State of New York:\n\nAFT...
2     To the People of the State of New York:\n\nWHE...
3     To the People of the State of New York:\n\nIT ...
4     To the People of the State of New York:\n\nMY ...
5     To the People of the State of New York:\n\nQUE...
                            ...                        
81    To the People of the State of New York:\n\nLET...
82    To the People of the State of New York:\n\nTHE...
83    To the People of the State of New York:\n\nTHE...
84    To the People of the State of New York:\n\nIN ...
85    To the People of the State of New York:\n\nACC...
Length: 85, dtype: object

Now we do some cleaning of the texts

In [None]:
words = (
    docs_fed.
    str.lower().
    str.replace("[^\w\s]", " ").
    str.split()
)

words

  str.replace("[^\w\s]", " ").


1     [to, the, people, of, the, state, of, new, yor...
2     [to, the, people, of, the, state, of, new, yor...
3     [to, the, people, of, the, state, of, new, yor...
4     [to, the, people, of, the, state, of, new, yor...
5     [to, the, people, of, the, state, of, new, yor...
                            ...                        
81    [to, the, people, of, the, state, of, new, yor...
82    [to, the, people, of, the, state, of, new, yor...
83    [to, the, people, of, the, state, of, new, yor...
84    [to, the, people, of, the, state, of, new, yor...
85    [to, the, people, of, the, state, of, new, yor...
Length: 85, dtype: object

In [None]:
from collections import Counter

words.apply(Counter)

1     {'to': 72, 'the': 133, 'people': 6, 'of': 106,...
2     {'to': 53, 'the': 107, 'people': 23, 'of': 83,...
3     {'to': 56, 'the': 93, 'people': 8, 'of': 62, '...
4     {'to': 51, 'the': 86, 'people': 8, 'of': 72, '...
5     {'to': 45, 'the': 66, 'people': 3, 'of': 53, '...
                            ...                        
81    {'to': 163, 'the': 389, 'people': 1, 'of': 248...
82    {'to': 83, 'the': 168, 'people': 1, 'of': 94, ...
83    {'to': 219, 'the': 485, 'people': 3, 'of': 331...
84    {'to': 140, 'the': 390, 'people': 11, 'of': 29...
85    {'to': 115, 'the': 246, 'people': 7, 'of': 172...
Length: 85, dtype: object

### Question 0

Recall that 15 of the papers have disputed authorship. How could you use the topics we have covered to "predict" the authors of these 15?

**Brainstorm some ideas before proceeding!**

**YOUR RESPONSE HERE.**

### Question 1

When analyzing an author's style, common words like "the" and "on" are actually more useful than rare words like "hostilities". That is because rare words typically signify context. Context is useful if you are trying to find documents about similar topics, but not so useful if you are trying to identify an author's style because different authors can write about the same topic. For example, both Dr. Seuss and Charles Dickens used rare words like "chimney" and "stockings" in _How the Grinch Stole Christmas_ and _A Christmas Carol_, respectively. But they used common words very differently: Dickens used the word "upon" over 100 times, while Dr. Seuss did not use "upon" even once.

Read in the Federalist Papers. Convert each one into a vector of term frequencies. In order to restrict to common words, include only the top 50 words across the corpus. (Because we are restricting to the most common words already, there is no reason to reweight them using TF-IDF, since the most common words will be in all the documents.)

In [None]:
tf = pd.DataFrame(list(words))
tf

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,5846,5847,5848,5849,5850,5851,5852,5853,5854,5855
0,to,the,people,of,the,state,of,new,york,after,...,,,,,,,,,,
1,to,the,people,of,the,state,of,new,york,when,...,,,,,,,,,,
2,to,the,people,of,the,state,of,new,york,it,...,,,,,,,,,,
3,to,the,people,of,the,state,of,new,york,my,...,,,,,,,,,,
4,to,the,people,of,the,state,of,new,york,queen,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80,to,the,people,of,the,state,of,new,york,let,...,,,,,,,,,,
81,to,the,people,of,the,state,of,new,york,the,...,,,,,,,,,,
82,to,the,people,of,the,state,of,new,york,the,...,being,vested,in,the,supreme,court,is,examined,and,refuted
83,to,the,people,of,the,state,of,new,york,in,...,,,,,,,,,,


### Question 2

Make a visualization that compares the most common words between Hamilton, Jay, and Madison.

In [None]:
# YOUR CODE HERE. ADD CELLS AS NEEDED

### Question 3

Recall that 15 of the papers have disputed authorship. How could you use the topics we have covered to "predict" the authors of these 15?

**Brainstorm some ideas before proceeding!** (Just making sure you discuss this question before moving on to the next part.)

**YOUR RESPONSE HERE.**

### Question 4

For each of the documents with disputed authorships, find the 5 most similar documents with _known_ authorships, using cosine distance on the term frequencies. Use the authors of these 5 most similar documents to predict the author of each disputed document.

For example, if 3 of the 5 closest documents were written by Hamilton, 1 by Madison, and 1 by Jay, then we would predict that the disputed document was written by Hamilton.

In [None]:
# YOUR CODE HERE. ADD CELLS AS NEEDED