# Info 2950 Final Project: A Data Driven Understanding of Marriage
#### Phase 2 (10/20/22)
#### by: Evan Schnell

#### Single Idea: How has the perception of marriage evolved over time? (The research questions I use are listed below)

### Data Format:

Each row of the enclosed dataframe consists of:
- A link to a book from Project Gutenberg on marriage encoded as a UTF-8 text file
- The title of the book
- The author's name
- An estimated year of publication (or the midpoint of the author's life)
- The author's gender
- Word counters for the categories: 
    - 1) anger
    - 2) anticipation
    - 3) disgust
    - 4) fear
    - 5) joy
    - 6) negative
    - 7) positive
    - 8) sadness
    - 9) surprise
    - 10) trust
- The sum of all the word counters
 
### Research Questions

Goal: My goal is to understand the evolving perception of marriage over time. As my data is incomplete due to financial constraints, I aim to determine whether or not an analysis of similar theme could be used to drive a properly funded study. Here is an outline of the analyses that I plan to perform in the final draft of my assignment

Q1) Are the categories provided in the lexicon (anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise, and trust) generally independent or do they interact with each other? Understanding the answer to this question will help with normalizing my data and it will allow for better analysis in future steps.

Q2) How has the perception of marriage changed over time?

Q3) What is the effect of the author’s gender on his or her perception of marriage?
 
### Plan for Answering the Research Questions 

##### Question 1
- I will determine how the categories (anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise, and trust) are related to each other by identifying the principal components of the data using Principal Component Analysis (PCA), a common unsupervised dimensionality reduction algorithm. I may perform this analysis separately on the data gathered from books written by males and females.
- Principal components are indices that are linear combinations of the original data categories. Understanding how the principal components are created for my data set will help me identify how variables are correlated within a higher dimensional data set. I hope to compare the results I obtain from my principal axes with results from a covariance matrix as a sanity check.
- Dimensional reduction techniques allow the user to account for more variables at the cost of losing some of the variance enclosed within the data. I hope to leverage PCA, a quantitative dimensional reduction technique, to determine what portion of the variance can be attributed to various principal axes and identify an optimal tradeoff between retained variance and the power of my tests.
- Understanding how the variables interact may help me normalize my data. I think comparing data normalized using the results of my PCA versus data normalized using the sum of all word counters could be interesting
 
##### Question 2
- I plan to run regressions on each of the emotional scores versus time. Based on the results of the principal axes I identify using PCA, the regressions I run in this step will likely change. 
- I plan to run data from male and female authors separately.
- How I attempt to answer this question will depend on the results I get from part 1.
- I will add more to this section once I know more.

##### Question 3
- How I attempt to answer this question will depend on the results I get from parts 1 and 2. 
- As of right now, I plan to run logistic regressions in a similar manner to how we did on homework 4.
- I will add more to this section once I know more.

### My Exploratory Data Analysis
I plan on performing the analysis written above. We have not leared dimensionality reduction, so I will perform a better analysis once I have time to figure this out. For now, I will perform my exploratory data analysis by normalizing using the sum of category scores.

The explanations for my exploratory data analysis are provided with the code below.

### Data Source
I obtained all of my data (on 178 books) from project Gutenberg: 

To see the books I chose from, go to advanced search and select: 
- Subject: marriage
- Language: English
- Filetype: Plain Text

After clicking on a book, you get the option to open the text as a variety of filetypes. I chose UTF-8 text files.

I excluded all duplicates and only considered the first book in a series, or the whole series as one text file, if possible, to keep one series from having too much of an effect on my data set. This reduced the number of viable books from 265 books to 178 books.

https://www.gutenberg.org/ebooks/results/?author=&title=&subject=marriage&lang=en&category=&locc=&filetype=txt&submit_search=Search&pageno=1
 
### Lexicon Source
The NRC Emotion Lexicon is self described as a "list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive)." This list can be used for free for educational and research purposes.

Here is the link to the website: https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm


I chose this list because it is extremly comprehensive and contains scores for over 14,000 words. It is also a reputable source that has been used in several academic papers.

I downloaded the lexicon as a CSV and modified it to obtain a reformatted lexicon that reduces my runtime. The process of reformatting the lexicon is described below.


### How I Reformatted the Lexicon
The CSV file contained over 140,000 rows containing:
- the word of interest
- the category
- a 0 or 1

The first word was the same in the first 10 rows so I reformatted the lexicon to store:
- the word of interest
- 10 numbers (a 0 or 1) for each category

I did not add a reformatted row if all 10 numbers were 0's

### How I Cleaned the Text Data

I generated each row in my dataframe by adding 1 to each category attribute when a word with that category attribute appeared in the text.
- I removed undesirable special characters and exploded the text into words
- I removed unneccesary whitespace and removed common words without significant meaning to increase runtime
- I compare the remaining words in the array with the reformatted lexicon and added the scores in the 10 categories of the lexicon to the overall score if a word in the lexicon appears in the list
- I realized that counting the number of occurrances per word, and then comparing with the lexicon would decrease the runtime, but I noticed this too late to make the optimization
- A friend who took data mining said this process I developed would be similar to vectorization if I implement the optimization, that I did not add (mentioned in the bullet above)

### Limitations
This study is extremely limited because:
- It is based on a collection of 178 books that I picked because they were free to access. This means that the books may not be representative of the time period they were published in because I used the avaliable data to generate more volume instead of just picking the most representative books (If I had money to spend on this project, I would pick 20 representative books for each decade and use these as my data set)
- There are periods of time that lack many data points while a lot of the dates of authorship seem to be from between 1800 - 1925. There is much less data from 1500-1799, meaning that there are significant gaps in my data.
- The selected books were pick to be restored by volunteers in the 2010-2020's. This means that books that appeal to our current values may be selected for (the volunteers pick the books they like)
- It relies on the accuracy of a lexicon that only scores words as having 0's or 1's per each category
- It does not consider sarcasm, context, or the tendency for the meaning of words to change over time

These limiations mean that we should not take the results of this study too seriously as the data has significant gaps and I did not have sufficient access to books to create a representative data set from which I could perform this study. Instead, this study should be seen as a precursor study that can be used to identify potential trends that could be investigated more thoroughly by a more extensive, better funded study. Nevertheless, the algorithms for collecting the data set and the analysis I plan to use could be used on a better data set to result in significant discoveries.

Taking the results of this study too seriously and basing your decisions on whether or not to get married can result in a reduced quality of life. It is better to make these decisions for yourself while asking family members and professionals for advice. This analysis is meant to be purely academic and with the quality of the initial data set, it should not be taken too seriously.

### File Descriptions (I will rename the files to make them more understandable later on)

##### Generating a cleaned lexicon
- NRC-Emotion-Lexicon-Wordlevel-v0.92: This is the origional lexicon I downloaded (csv)
- Cleaned Lexicon: I clean the lexicon in this script (ipynb)
- new_lexicon_no_zeros: this is the cleaned lexicon I generate (txt)

##### Numeric scores
- DataGen: I generate the quantitative 10 category scores in this script (ipynb)
- 2950 Data: This is the final text document 

##### Generating the dataframe I use
- Final Dataframe: I generate the final dataframe that I will use in my analysis (ipynb)