# Mini-project part III: Research report

We've put some effort into building our collection of volumes - you can find them in [this Google Drive folder](https://drive.google.com/drive/u/0/folders/1UAaGIiqElF9YLTGIy6hQYM7QcGzosZgR). Now it's time to learn something about it. You already have lots of excellent ideas for how to apply the tools we've learned about so far. It's also a good time in the semester to review what we have learned and practice applying it in less structured settings.

**You will work by yourself or in a group of up to three people** to complete a short project applying methods from the previous weeks to this collection. You will turn in the completed project as a single notebook (one submission per group) with the following sections:

0. **Project team.** List members with full names and NetIDs. If your group does not contain at least one native speaker of English, let us know.

1. **Question(s) (10 points).** Describe what you wanted to learn. Suggest several possible answers or hypotheses, and describe in general terms what you might expect to see if each of these answers were true (save specific measurements for the next section). For example, many students want to know the difference between horror and non-horror fiction, or between detective stories and horror fiction, but there are many ways to operationalize this question. You do not need to limit yourself to questions of genre. **Note that your question should be interesting! If the answer is obvious before you begin, or if it's something the importance of which you cannot explain, your grade will suffer (a lot).** 

1. **Methods (10 points).** Describe how you will use computational methods presented so far in this class to answer your question. What do the computational tools do, and how does their output relate to your question? Describe how you will process the collection into a form suitable for a model or algorithm and why you have processed it the way you have.

1. **Code (20 points).** Carry out your experiments. Code should be correct (no errors) and focused (unneeded code from examples is removed). Use the notebook format effectively: code may be incorporated into multiple sections.

1. **Results and discussion (40 points).** Use sorted lists, tables, and visual presentations to make your argument. Excellent projects will provide multiple views of results, and follow up on any apparent outliers or strange cases, including through careful reading of the original documents.

1. **Reflection (10 points).** Describe your experience in this process. What was harder or easier than you expected? What compromises or negotiations did you have to accept to match the collection, the question, and the methods? What would you try next? Your reflections should be written as single narrative that incorporates the viewpoints of all group members.

1. **Resources consulted (0 points, but -5 if missing).** Credit any online sources (Stack Overflow, blog posts, documentation, etc.) that you found helpful.

1. **Responsibility statement (0 points, but -5 if missing).** See separate CMS assignment "MP 03: Responsibility statement". **Note:** If you worked alone on the project (a group of one), you are not required to submit the responsibility statement.


## Submission instructions

1. Complete this Jupyter Notebook.
1. Open a group submission on CMS - Let us know if you encounter any problems with that!
1. Assign the role of group submitter to a member of your group.
1. Submit the completed Jupyter Notebook in the group submission on CMS.

**Note:** If you worked alone on the project (a group of one), you are not required to submit the responsibility statement.

Otherwise...

1. Go to the CMS assignment 'MP 03: Responsibility statement'.
1. Complete the statement, describing each group member's contributions as you see them. 
1. Upload the statement as an **individual** submission on CMS.


## Guidance and advice

Show us what you've learned so far. Try to use a range of methods while remaining focused on your chosen problem. For inspiration, consider the range of research problems you've enoucountered in the readings. Note that dictionary-based sentiment scoring projects have not historically done well without substantial additional methodological diversity.

**We will grade this work based on accuracy, thoroughness, creativity, reflectiveness, and quality of presentation.** Code and results that are merely correct are generally not enough, on their own, to achieve a high score.

**Scope:** this is a *mini*-project, with a short deadline. We are expecting work that is consistent with that timeframe, but that is serious, thoughtful, and rigorous. This assignment will almost certainly require more time and effort than the typical weekly homework. **For group work, the expected scope grows linearly with the number of participants.**

# 0. Project team

Estelle Hooper (ehh52), Gabriella Chu (gc386)

# 1. Question(s) (10 points)

### Research Question: How can we recommend books of other genres to lovers of a single genre?
#### Motivation 
When choosing to consume any form of entertainment, such as a TV show, play, or movie, people tend to favor a particular genre or gravitate towards a favorite story. Unlike the aforementioned examples, people tend to consume books or novels at their own pace, which could be considered more time consuming and costly in terms of mental effort to stay committed to a written story. Because time and attention are scarce, people are more likely to read books of one genre. Many people do not have the capacity to branch out and try new genres, feeling that the safer choice is to read a more predictable book of their favored genre rather than risk not enjoying content of a different genre.
#### Hypothesis
In order to have confidence in another genre, there must be overlapping qualities between a reader's favorite genre and the new genre. Fans of the horror genre may be fond of the setting because horror fiction can often involve supernatural beings or powers, such as ghosts. Therefore, they may also enjoy science fiction or fantasy that similarly incorporates elements outside of the natural world. Besides content, genres may overlap in how the genre's authors tend to portray characters. Adventure and detective genres tend to involve a partner-in-crime in which the two characters develop strong trust. If the character interactions are important, then perhaps fans of the detective genre would also be a fan of the romance genre since the latter involves the development of a relationship between characters.
#### Project
Our corpus consists of a good sample of horror novels, detective novels, and a miscellanous group of novels from other genres. We want to see how close these "other novels" are to either the horror genre or detective genre. We will train a classifier using the detective and horror novels, and run that classifier on the "other" novels. Those "other" novels will receive a horror or detective label (despite not being canonically considered part of the horror or detective genres). With these classifier outcomes, we will recommend the "other" books labeled "detective" to fans of the detective genre and those labeled "horror" to the horror genre. For example, the classifier ran on *Harry Potter*, a fantasy series, and gave it a detective label, we would recommend detective fans to read *Harry Potter.*
#### Expectations
Overall, we don't expect the recommendations we obtain from the classifer to be very great because it is only working with detective and horror, and it will be hard to find overlapping words with only those two genres and the various other genres in the corpus. We expect detective and horror novels with the most subplots and settings similar to other genres to score well with those genres. For example, if detective novels tend to have romance subplots, we expect the classifier to label romance novels as detective.

# 2. Methods (10 points)

### Data Preparation
From the class corpus, we will divide the novels into two subsets: horror and detective, and the rest of the novels. The horror and detective will act as our "training" dataset because we will train the classifier to classify text as either **detective (==1) or horror (==0).** All the other novels of other genres will act as our "testing" dataset, and the classifier will predict these novels as either horror or detective. We will use the detective column in the metadata to get the gold labels of the training data in order to score our model later on. We'll also read/open all the novels in this step.
### Vectorization
We will vectorize the novels with tfidf weighting, L2, and z-scores to standardize and normalize our corpus, and consider pre-processing factors like stopwords and lowercasing. Them, we'll create multilple vectorizers with a variety of max input features and fit the training data (true horror and detective). We will try different combinations of these matrices (each with a different number input features) with different types of classifiers in the next step.
### Classifier
We will test different classification methods with different parameters (Multinomial Naive Bayes and Logistic Regression) with our X_train matrix of different sizes and the gold labels (y_train) and compare the accuracy/precision/recall/F1 scores and select the best classifier.
### Feature Selection 
After finding the model with the best F1 score, we will examine the words/tokens/features with the heaviest weights that aid in classifying a work as detective or horror. If we see odd words, such as character names, or common stopwords as the top features, we will reconsider our vectorization steps.
### Testing
After finalizing our classifier, we will run it on our testing data, or the novels of other genres without detective or horror labels. From this classifier, each "other" novel will receive a horror or detective label (y_test).
### Scatterplots/Visualizations
Because the classifier only makes binary decisions, we want to examine the most "detective-y" and "horror-y" novels because there will be some novels that the classifer could not make as definitive decisions for. We will use SVD to graph all the novels in 2D space. We will then create multiple scatterplots to look at these decision boundaries. The most important scatterplots will be the ones that look at how close the true labeled novels are to the predicted  novels are in distance space. We will consider the predicted dots that are closest to the true dots to be the best recommendations for that genre.

# 3. Code (20 points)

In [1]:
# Imports (all of them!)
%matplotlib inline
import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import seaborn as sns
from   sklearn.decomposition import TruncatedSVD
from   sklearn.feature_extraction.text import TfidfVectorizer
from   sklearn.preprocessing import StandardScaler, MinMaxScaler, normalize
from   sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

## Data Preparation
From class corpus metadata, get training books, testing books, and gold labels for training books.
- training data = all books that are detective==True **or** horror==True. Drop all books have both genres as true
- testing data = all other books

In [2]:
# class corpus metadata
metadata = pd.read_csv("class_corpus_metadata.csv")
metadata.shape

(160, 34)

In [3]:
metadata.head()

Unnamed: 0.1,Unnamed: 0,check_1,check_2,title,year,author1_surname,author1_givenname,author2_surname,author2_givenname,gender_author1,...,feminist fiction,mystery,adventure,tragedy,children,regency,manners,philosophical,coming-of-age,filename
0,nsg57,scw222,lcc82,"Writings in the United Amateur, 1915 - 1922",1922,Lovecraft,Howard,,,Male,...,False,True,False,False,False,False,False,True,False,Lovecraft_WritingsintheUnitedAmateur1915-1922.txt
1,fhh26,gs542,tj256,Whose Body?,1923,Sayers,Dorothy L.,,,Female,...,False,True,False,False,False,False,False,False,False,Sayres_WhoseBody.txt
2,cl2264,,,Voodoo Planet,1959,Norton,Andre,,,Female,...,False,False,True,False,False,False,False,False,False,Norton_VoodooPlanet.txt
3,ehh52,sjr255,kg428,"Varney the Vampire; Or, the Feast of Blood by ...",1845,Rymer,James Malcolm,Prest,Thomas Peckett,Male,...,False,False,False,False,False,False,False,False,False,Prest_Rhymer_VarneyTheVampire.txt
4,dgr73,jlp367,kg428,Uncle Tom's Cabin,1852,Stowe,Harriet Beecher,,,Female,...,False,False,False,False,False,False,False,False,False,Stowe_UncleTom_sCabin.txt


In [4]:
# training data are books that are either horror or detective
training_data = metadata[(metadata['horror']==True) | (metadata['detective']==True)]

# drop books that are both horror and detective
drop = metadata[(metadata['horror']==True) & (metadata['detective']==True)]
training_data = training_data.drop(drop.index)

# testing data are books are neither horror or detective
testing_data = metadata[(metadata['horror']==False) & (metadata['detective']==False)]

# sort titles alphabetically 
training_data = training_data.sort_values('title')
testing_data = testing_data.sort_values('title')
# note: training+testing+dropped row = 159 rows, class corpus = 160 rows, "An Unkindness of Ghosts" has no input for horror and detective column

# 4. Results and discussion (40 points)

In [None]:
# few paragraphs
# realistic - 3 results
# figure, table, accuracy table
# analyze each of the results, one paragraph at the end how they fit together
# group -- a bit more

# 5. Reflection (10 points)

# 6. Resources consulted (0 points, but -5 if missing)

https://stackoverflow.com/questions/12618030/how-to-replace-back-slash-character-with-empty-string-in-python


# 7. Responsibility statement (0 points, but -5 if missing)
**See separate CMS assignment 'MP 03: Responsibility statement'.**