# Mini-project part III: Research report

We've put some effort into building our collection of volumes - you can find them in [this Google Drive folder](https://drive.google.com/drive/u/0/folders/1UAaGIiqElF9YLTGIy6hQYM7QcGzosZgR). Now it's time to learn something about it. You already have lots of excellent ideas for how to apply the tools we've learned about so far. It's also a good time in the semester to review what we have learned and practice applying it in less structured settings.

**You will work by yourself or in a group of up to three people** to complete a short project applying methods from the previous weeks to this collection. You will turn in the completed project as a single notebook (one submission per group) with the following sections:

0. **Project team.** List members with full names and NetIDs. If your group does not contain at least one native speaker of English, let us know.

1. **Question(s) (10 points).** Describe what you wanted to learn. Suggest several possible answers or hypotheses, and describe in general terms what you might expect to see if each of these answers were true (save specific measurements for the next section). For example, many students want to know the difference between horror and non-horror fiction, or between detective stories and horror fiction, but there are many ways to operationalize this question. You do not need to limit yourself to questions of genre. **Note that your question should be interesting! If the answer is obvious before you begin, or if it's something the importance of which you cannot explain, your grade will suffer (a lot).** 

1. **Methods (10 points).** Describe how you will use computational methods presented so far in this class to answer your question. What do the computational tools do, and how does their output relate to your question? Describe how you will process the collection into a form suitable for a model or algorithm and why you have processed it the way you have.

1. **Code (20 points).** Carry out your experiments. Code should be correct (no errors) and focused (unneeded code from examples is removed). Use the notebook format effectively: code may be incorporated into multiple sections.

1. **Results and discussion (40 points).** Use sorted lists, tables, and visual presentations to make your argument. Excellent projects will provide multiple views of results, and follow up on any apparent outliers or strange cases, including through careful reading of the original documents.

1. **Reflection (10 points).** Describe your experience in this process. What was harder or easier than you expected? What compromises or negotiations did you have to accept to match the collection, the question, and the methods? What would you try next? Your reflections should be written as single narrative that incorporates the viewpoints of all group members.

1. **Resources consulted (0 points, but -5 if missing).** Credit any online sources (Stack Overflow, blog posts, documentation, etc.) that you found helpful.

1. **Responsibility statement (0 points, but -5 if missing).** See separate CMS assignment "MP 03: Responsibility statement". **Note:** If you worked alone on the project (a group of one), you are not required to submit the responsibility statement.


## Submission instructions

1. Complete this Jupyter Notebook.
1. Open a group submission on CMS - Let us know if you encounter any problems with that!
1. Assign the role of group submitter to a member of your group.
1. Submit the completed Jupyter Notebook in the group submission on CMS.

**Note:** If you worked alone on the project (a group of one), you are not required to submit the responsibility statement.

Otherwise...

1. Go to the CMS assignment 'MP 03: Responsibility statement'.
1. Complete the statement, describing each group member's contributions as you see them. 
1. Upload the statement as an **individual** submission on CMS.


## Guidance and advice

Show us what you've learned so far. Try to use a range of methods while remaining focused on your chosen problem. For inspiration, consider the range of research problems you've enoucountered in the readings. Note that dictionary-based sentiment scoring projects have not historically done well without substantial additional methodological diversity.

**We will grade this work based on accuracy, thoroughness, creativity, reflectiveness, and quality of presentation.** Code and results that are merely correct are generally not enough, on their own, to achieve a high score.

**Scope:** this is a *mini*-project, with a short deadline. We are expecting work that is consistent with that timeframe, but that is serious, thoughtful, and rigorous. This assignment will almost certainly require more time and effort than the typical weekly homework. **For group work, the expected scope grows linearly with the number of participants.**

# 0. Project team

Estelle Hooper (ehh52), Gabriella Chu (gc386)

# 1. Question(s) (10 points)

### Research Question: How can we recommend books of other genres to lovers of a single genre?
#### Motivation 
When choosing to consume any form of entertainment, such as a TV show, play, or movie, people tend to favor a particular genre or gravitate towards a favorite story. Unlike the aforementioned examples, people tend to consume books or novels at their own pace, which could be considered more time consuming and costly in terms of mental effort to stay committed to a written story. Because time and attention are scarce, people are more likely to read books of one genre. Many people do not have the capacity to branch out and try new genres, feeling that the safer choice is to read a more predictable book of their favored genre rather than risk not enjoying content of a different genre.
#### Hypothesis
In order to have confidence in another genre, there must be overlapping qualities between a reader's favorite genre and the new genre. Fans of the horror genre may be fond of the setting because horror fiction can often involve supernatural beings or powers, such as ghosts. Therefore, they may also enjoy science fiction or fantasy that similarly incorporates elements outside of the natural world. Besides content, genres may overlap in how the genre's authors tend to portray characters. Adventure and detective genres tend to involve a partner-in-crime in which the two characters develop strong trust. If the character interactions are important, then perhaps fans of the detective genre would also be a fan of the romance genre since the latter involves the development of a relationship between characters.
#### Project
Our corpus consists of a good sample of horror novels, detective novels, and a miscellanous group of novels from other genres. We want to see how close these "other novels" are to either the horror genre or detective genre. We will train a classifier using the detective and horror novels, and run that classifier on the "other" novels. Those "other" novels will receive a horror or detective label (despite not being canonically considered part of the horror or detective genres). With these classifier outcomes, we will recommend the "other" books labeled "detective" to fans of the detective genre and those labeled "horror" to the horror genre. For example, the classifier ran on *Harry Potter*, a fantasy series, and gave it a detective label, we would recommend detective fans to read *Harry Potter.*
#### Expectations
Overall, we don't expect the recommendations we obtain from the classifer to be very great because it is only working with detective and horror, and it will be hard to find overlapping words with only those two genres and the various other genres in the corpus. We expect detective and horror novels with the most subplots and settings similar to other genres to score well with those genres. For example, if detective novels tend to have romance subplots, we expect the classifier to label romance novels as detective.

# 2. Methods (10 points)

### I. Data Preparation
From the class corpus, we will divide the novels into two subsets: horror and detective, and the rest of the novels. The horror and detective will act as our "training" dataset because we will train the classifier to classify text as either **detective (==1) or horror (==0).** All the other novels of other genres will act as our "testing" dataset, and the classifier will predict these novels as either horror or detective. We will use the detective column in the metadata to get the gold labels of the training data in order to score our model later on. We'll also read/open all the novels in this step.
### II. Vectorization
We will vectorize the novels with tfidf weighting, L2, and z-scores to standardize and normalize our corpus, and consider pre-processing factors like stopwords and lowercasing. Them, we'll create multilple vectorizers with a variety of max input features and fit the training data (true horror and detective). We will try different combinations of these matrices (each with a different number input features) with different types of classifiers in the next step.
### III. Classifier
We will test different classification methods with different parameters (Multinomial Naive Bayes and Logistic Regression) with our X_train matrix of different sizes and the gold labels (y_train) and compare the accuracy/precision/recall/F1 scores and select the best classifier.
### IV. Feature Importance 
After finding the model with the best F1 score, we will examine the words/tokens/features with the heaviest weights that aid in classifying a work as detective or horror. If we see odd words, such as character names, or common stopwords as the top features, we will reconsider our vectorization steps.
### V. Testing/Predicting
After finalizing our classifier, we will run it on our testing data, or the novels of other genres without detective or horror labels. From this classifier, each "other" novel will receive a horror or detective label (y_test).
### VI. Scatterplots/Visualizations (to be discussed in Part 4: Results)
Because the classifier only makes binary decisions, we want to examine the most "detective-y" and "horror-y" novels because there will be some novels that the classifer could not make as definitive decisions for. We will use SVD to graph all the novels in 2D space. We will then create multiple scatterplots to look at these decision boundaries. The most important scatterplots will be the ones that look at how close the true labeled novels are to the predicted  novels are in distance space. We will consider the predicted dots that are closest to the true dots to be the best recommendations for that genre.

# 3. Code (20 points)

In [1]:
# Imports (all of them!)
%matplotlib inline
import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import seaborn as sns
from   sklearn.decomposition import TruncatedSVD
from   sklearn.feature_extraction.text import TfidfVectorizer
from   sklearn.preprocessing import StandardScaler, MinMaxScaler, normalize
from   sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

## I. Data Preparation
In this section, we cleaned and manipulated the class metadata in order to subset the data into training and testing. As stated previously, our training data is all horror or detective, and our testing is all other books that are not horror and detective. The following is a list of variables that we created to accomplish in order to proceed with vectorization. Besides subsetting, the most notable tasks we did were sorting the dataframes by title in alphabetical order and obtaining the gold labels (y_train), which will be used to cross_validate our model when we build our classifer. 
- `metadata` df; metadata spreadsheet from class corpus
- `training_data` df; metadata with training novels, received by subsetting metadata Detective==True | Horror==True
- `testing_data` df; metadata with testing novels, received by subsetting metadata Detective==True & Horror==True
- `training_names` list; list of filenames, recieved by training_data.filename.values
- `testing_names` list; list of filenames, recieved by testing_data.filename.values
- `training_books` list; list of strings/books, recieved by reading files using training_names
- `testing_books` list; list of strings/books, recieved by reading files using testing_names
- `y_train` list; the **gold labels (1=detective, 0=horror)**, or true values of the training data, obtained from the training_data detective column


In [2]:
# class corpus metadata
metadata = pd.read_csv("class_corpus_metadata.csv")
metadata.shape

(160, 34)

In [3]:
metadata.head()

Unnamed: 0.1,Unnamed: 0,check_1,check_2,title,year,author1_surname,author1_givenname,author2_surname,author2_givenname,gender_author1,...,feminist fiction,mystery,adventure,tragedy,children,regency,manners,philosophical,coming-of-age,filename
0,nsg57,scw222,lcc82,"Writings in the United Amateur, 1915 - 1922",1922,Lovecraft,Howard,,,Male,...,False,True,False,False,False,False,False,True,False,Lovecraft_WritingsintheUnitedAmateur1915-1922.txt
1,fhh26,gs542,tj256,Whose Body?,1923,Sayers,Dorothy L.,,,Female,...,False,True,False,False,False,False,False,False,False,Sayres_WhoseBody.txt
2,cl2264,,,Voodoo Planet,1959,Norton,Andre,,,Female,...,False,False,True,False,False,False,False,False,False,Norton_VoodooPlanet.txt
3,ehh52,sjr255,kg428,"Varney the Vampire; Or, the Feast of Blood by ...",1845,Rymer,James Malcolm,Prest,Thomas Peckett,Male,...,False,False,False,False,False,False,False,False,False,Prest_Rhymer_VarneyTheVampire.txt
4,dgr73,jlp367,kg428,Uncle Tom's Cabin,1852,Stowe,Harriet Beecher,,,Female,...,False,False,False,False,False,False,False,False,False,Stowe_UncleTom_sCabin.txt


In [4]:
# training data are books that are either horror or detective
training_data = metadata[(metadata['horror']==True) | (metadata['detective']==True)]

# drop books that are both horror and detective
drop = metadata[(metadata['horror']==True) & (metadata['detective']==True)]
training_data = training_data.drop(drop.index)

# testing data are books are neither horror or detective
testing_data = metadata[(metadata['horror']==False) & (metadata['detective']==False)]

# sort titles alphabetically 
training_data = training_data.sort_values('title')
testing_data = testing_data.sort_values('title')
# note: training+testing+dropped row = 159 rows, class corpus = 160 rows, "An Unkindness of Ghosts" has no input for horror and detective column

### There are 80 combined horror and detective novels in the corpus that we will use to train the classifier.

In [5]:
training_data=training_data.reset_index(drop=True)
training_data

Unnamed: 0.1,Unnamed: 0,check_1,check_2,title,year,author1_surname,author1_givenname,author2_surname,author2_givenname,gender_author1,...,feminist fiction,mystery,adventure,tragedy,children,regency,manners,philosophical,coming-of-age,filename
0,tl566,hz542,ja532,813,1910,Leblanc,Maurice,,,Male,...,False,True,False,False,False,False,False,False,False,Leblanc_813.txt
1,gc386,,,A Strange Disappearance,1998,Green,Anna Katharine,,,Female,...,False,True,False,False,False,False,False,False,False,GreenAnnaKatharine_AStrangeDisappearance.txt
2,nca28,tl566,stw43,A Study in Scarlet,1887,Conan Doyle,Arthur,,,Male,...,False,True,False,False,False,False,False,False,False,ConanDoyle_AStudyInScarlet.txt
3,jc2739,,,Agatha Webb,1899,Green,Anna Katharine,,,Female,...,False,True,False,False,False,False,False,False,False,Green_AgathaWebb.txt
4,lcc82,yk499,,Carmilla,1872,Le_Fanu,Joseph Sheridan,,,Male,...,False,False,False,False,False,False,False,False,False,Carmilla.txt
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,tr333,sjs457,sl2324,The Valley of Fear,1915,Doyle,Arthur Conan,,,Male,...,False,False,False,False,False,False,False,False,False,Doyle_TheValleyOfFear.txt
76,lrs263,sh785,hz542,The Wisdom of Father Brown,1914,Chesterton,Gilbert Keith,,,Male,...,False,False,False,False,False,False,False,False,False,Chesterton_TheWisdomOfFatherBrown.txt
77,ehh52,sjr255,kg428,"Varney the Vampire; Or, the Feast of Blood by ...",1845,Rymer,James Malcolm,Prest,Thomas Peckett,Male,...,False,False,False,False,False,False,False,False,False,Prest_Rhymer_VarneyTheVampire.txt
78,fhh26,gs542,tj256,Whose Body?,1923,Sayers,Dorothy L.,,,Female,...,False,True,False,False,False,False,False,False,False,Sayres_WhoseBody.txt


### There are 78 combined novels from a variety of genres that are not horror or detective in the corpus.

In [7]:
testing_data=testing_data.reset_index(drop=True)
testing_data

Unnamed: 0.1,Unnamed: 0,check_1,check_2,title,year,author1_surname,author1_givenname,author2_surname,author2_givenname,gender_author1,...,feminist fiction,mystery,adventure,tragedy,children,regency,manners,philosophical,coming-of-age,filename
0,tr333,sjs457,sl2324,A Round Dozen,1883,Coolidge,Susan,,,Female,...,False,False,False,False,False,False,False,False,False,Coolidge_ARoundDozen.txt
1,kwy3,cl922,hk627,A Sicillian Romance,1790,Radcliffe,Ann Ward,,,Female,...,False,False,False,False,False,False,False,False,False,radcliffeann_a_sicillian_romance.txt
2,lqz4,gt294,lcc82,Adele Doring at Boarding-School,1921,North,Grace May,,,Female,...,False,False,False,False,True,False,False,False,False,adele_doring_boarding_school.txt
3,yc2669,xf89,wms87,Agnes Grey,1847,Bronte,Anne,,,Female,...,True,False,False,False,False,False,True,False,False,Bronte_AgnesGrey.txt
4,mn454,ar2465,jlp367,An Old-Fashioned Girl,1869,Alcott,Louisa May,,,Female,...,False,False,False,False,True,False,True,False,True,Alcott_AnOld-FashionedGirl.txt
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73,jc2739,,,This Side of Paradise,1920,Fitzgerald,F. Scott,,,Male,...,False,False,False,False,False,False,False,False,True,Fitzgerald_ThisSideOfParadise.txt
74,vs339,thh55,,To Kill A Mockingbird,1960,Lee,Harper,,,Female,...,False,False,False,False,False,False,False,False,False,Lee_ToKillAMockingbird.txt
75,fhh26,gs542,tj256,Twenty Thousand Leagues Under the Sea,1870,Verne,Jules,,,Male,...,False,False,True,False,False,False,False,False,False,Verne_TwentyThousandLeagues.txt
76,dgr73,jlp367,kg428,Uncle Tom's Cabin,1852,Stowe,Harriet Beecher,,,Female,...,False,False,False,False,False,False,False,False,False,Stowe_UncleTom_sCabin.txt


### Opening book files

In [8]:
# get book file names to open
training_names = training_data.filename.values
testing_names = testing_data.filename.values

In [9]:
print('First book in the training dataset:',training_names[0])
print('First book in the testing dataset:',testing_names[0])

First book in the training dataset: Leblanc_813.txt
First book in the testing dataset: Coolidge_ARoundDozen.txt


In [10]:
# 1=detective, 0=horror， gold labels
y_train=(training_data.detective.values*1).astype('int')
y_train

In [11]:
# open and append training books together
training_books=[]
for book in training_names:
    with open(book, 'r',encoding='utf-8') as f:
        file = f.read().replace("\n", " ") 
        training_books.append(file)
# open and append testing books together
testing_books=[]
for book in testing_names:
    with open(book, 'r',encoding='utf-8') as f:
        file = f.read().replace("\n", " ") 
        testing_books.append(file)

## II. Vectorization, stopwords, normalization, standardization, matrix fitting

### Stopwords
We played around with using stopwords and changing the max_df parameter in the TfidfVectorizer(). We initially did use punctuation as stopwords and compared the matrix with and without using that parameter, and the resulting shape was the same. Given that result, we decided to ignore the stop_words parameter and let the idf weighting take care of the normalization, in addition to L2. We picked L2 normalization over L1 because....

In [13]:
punct=[]
for x in string.punctuation:
    punct.append(x)
punct.append('--')
punct.append('`')
punct.append("“")
punct.append("”")

### Vectorization (+ normalization)

In [14]:
# Custom preprocessing to remove escaped characters in input, taken from MP02
def pre_proc(x):
    '''
    Takes a unicode string.
    Lowercases, strips accents, and removes some escapes.
    Returns a standardized version of the string.
    '''
    import unicodedata
    return unicodedata.normalize('NFKD', x.replace("\'", "'").replace("\ in\ form", " inform").lower().strip())

# Set up vectorizer

vectorizer = TfidfVectorizer(
    encoding='utf-8',
    preprocessor=pre_proc,
   # stop_words=punct,
    min_df=2, # Note this
    max_df=0.8, # This, too
    binary=False,
    norm='l2',
    use_idf=True, # And this,
    #max_features=10000
)

# Your code here
X_train = vectorizer.fit_transform(training_books)
print("Matrix shape:", X_train.shape)

Matrix shape: (80, 30376)


### Standardization: Z-scores

In [15]:
#standardization
X_train_Z = StandardScaler(with_mean=False).fit_transform(X_train)
display(X_train_Z)
print('z-scored l2 mean:', round(np.mean(X_train_Z),3))
#np.std(X_train)

<80x30376 sparse matrix of type '<class 'numpy.float64'>'
	with 376031 stored elements in Compressed Sparse Row format>

z-scored l2 mean: 0.292


### Matrix fitting

In [17]:
vect_n=[]
matrix_n= {}
feat_n = [5000,10000,15000,17500,20000,22500,25000,30000,35000] # NOTE, get all the features we had to use a value greater than the number of features. So 35000 is actually just all features
for x in feat_n:
    vectorizer = TfidfVectorizer(
        encoding='utf-8',
        preprocessor=pre_proc,
        min_df=2, # Note this
        max_df=0.8, # This, too
        binary=False,
        norm='l2',
        use_idf=True, # And this
        max_features=x)
    vect_n.append(vectorizer)
    matrix = vectorizer.fit_transform(training_books)
    X_train_Z = StandardScaler(with_mean=False).fit_transform(matrix)
    dict_key=str(x)
    matrix_n[dict_key] = X_train_Z

In [18]:
matrix_n

{'5000': <80x5000 sparse matrix of type '<class 'numpy.float64'>'
 	with 187794 stored elements in Compressed Sparse Row format>,
 '10000': <80x10000 sparse matrix of type '<class 'numpy.float64'>'
 	with 274957 stored elements in Compressed Sparse Row format>,
 '15000': <80x15000 sparse matrix of type '<class 'numpy.float64'>'
 	with 320917 stored elements in Compressed Sparse Row format>,
 '17500': <80x17500 sparse matrix of type '<class 'numpy.float64'>'
 	with 335934 stored elements in Compressed Sparse Row format>,
 '20000': <80x20000 sparse matrix of type '<class 'numpy.float64'>'
 	with 347813 stored elements in Compressed Sparse Row format>,
 '22500': <80x22500 sparse matrix of type '<class 'numpy.float64'>'
 	with 357000 stored elements in Compressed Sparse Row format>,
 '25000': <80x25000 sparse matrix of type '<class 'numpy.float64'>'
 	with 364318 stored elements in Compressed Sparse Row format>,
 '30000': <80x30000 sparse matrix of type '<class 'numpy.float64'>'
 	with 375

## III. Classifier

In [16]:
# Examine the performance of our simple classifiers
# Freebie function to summarize and display classifier scores
# source: mp2
def compare_scores(scores_dict):
    '''
    Takes a dictionary of cross_validate scores.
    Returns a color-coded Pandas dataframe that summarizes those scores.
    '''
    df = pd.DataFrame(scores_dict).T.applymap(np.mean).style.background_gradient(cmap='RdYlGn')
    return df

### Multinomial Naive Bayes

In [19]:
nb_classifiers = {
    'M NB Default, Alpha=1':MultinomialNB(alpha = 1),
    'M NB fit_prior=False':MultinomialNB(fit_prior = False),
}
scores = {} # Store cross-validation results in a dictionary
for classifier in nb_classifiers: 
    scores[classifier] = cross_validate( # perform cross-validation
        nb_classifiers[classifier], # classifier object
        X_train_Z, # feature matrix
        y_train, # gold labels
        cv=10, #number of folds
        scoring=['accuracy','precision', 'recall', 'f1', 'f1_macro', 'f1_micro'] # scoring methods
    )
       
compare_scores(scores)

Unnamed: 0,fit_time,score_time,test_accuracy,test_precision,test_recall,test_f1,test_f1_macro,test_f1_micro
"M NB Default, Alpha=1",0.013639,0.012266,0.8875,0.868095,0.98,0.917879,0.867273,0.8875
M NB fit_prior=False,0.011189,0.009851,0.8875,0.868095,0.98,0.917879,0.867273,0.8875


In [20]:
scores = {} # Store cross-validation results in a dictionary
for matrix in matrix_n: 
    print(matrix_n[matrix].shape)
    scores[matrix] = cross_validate( # perform cross-validation
        MultinomialNB(alpha = 1), # classifier object
        matrix_n[matrix], # feature matrix
        y_train, # gold labels
        cv=10, #number of folds
        scoring=['accuracy','precision', 'recall', 'f1', 'f1_macro', 'f1_micro'] # scoring methods
    )

(80, 5000)
(80, 10000)
(80, 15000)
(80, 17500)
(80, 20000)
(80, 22500)
(80, 25000)
(80, 30000)
(80, 30376)


In [21]:
compare_scores(scores)

Unnamed: 0,fit_time,score_time,test_accuracy,test_precision,test_recall,test_f1,test_f1_macro,test_f1_micro
5000,0.006741,0.009335,0.8875,0.868095,0.98,0.917879,0.867273,0.8875
10000,0.008804,0.010193,0.8875,0.868095,0.98,0.917879,0.867273,0.8875
15000,0.009937,0.009778,0.875,0.847619,1.0,0.913636,0.841818,0.875
17500,0.010999,0.010392,0.9,0.87619,1.0,0.930303,0.875152,0.9
20000,0.011596,0.00995,0.8875,0.864286,1.0,0.922727,0.856364,0.8875
22500,0.011248,0.010174,0.8625,0.844286,0.98,0.902727,0.829697,0.8625
25000,0.01268,0.010711,0.875,0.85619,0.98,0.910303,0.848485,0.875
30000,0.01534,0.012773,0.8875,0.868095,0.98,0.917879,0.867273,0.8875
35000,0.011224,0.01045,0.8875,0.868095,0.98,0.917879,0.867273,0.8875


In [22]:
vect_n2=[]
matrix_n2= {}
feat_n2 = [15000,16000,17000,18000,19000,20000,17500]
for x in feat_n2:
    vectorizer = TfidfVectorizer(
        encoding='utf-8',
        preprocessor=pre_proc,
        min_df=2, # Note this
        max_df=0.8, # This, too
        binary=False,
        norm='l2',
        use_idf=True, # And this
        max_features=x)
    vect_n.append(vectorizer)
    matrix = vectorizer.fit_transform(training_books)
    X_train_Z = StandardScaler(with_mean=False).fit_transform(matrix)
    dict_key=str(x)
    matrix_n2[dict_key] = X_train_Z

In [23]:
scores2 = {} # Store cross-validation results in a dictionary
for matrix2 in matrix_n2: 
    scores2[matrix2] = cross_validate( # perform cross-validation
        MultinomialNB(), # classifier object
        matrix_n2[matrix2], # feature matrix
        y_train, # gold labels
        cv=10, #number of folds
        scoring=['accuracy','precision', 'recall', 'f1', 'f1_macro', 'f1_micro'] # scoring methods
    )

In [24]:
compare_scores(scores2)

Unnamed: 0,fit_time,score_time,test_accuracy,test_precision,test_recall,test_f1,test_f1_macro,test_f1_micro
15000,0.00994,0.00958,0.875,0.847619,1.0,0.913636,0.841818,0.875
16000,0.009666,0.009706,0.875,0.864286,0.98,0.911616,0.843665,0.875
17000,0.010716,0.009188,0.875,0.864286,0.98,0.911616,0.843665,0.875
18000,0.010003,0.009495,0.8875,0.87619,0.98,0.919192,0.862453,0.8875
19000,0.010505,0.009096,0.9,0.87619,1.0,0.930303,0.875152,0.9
20000,0.010608,0.009014,0.8875,0.864286,1.0,0.922727,0.856364,0.8875
17500,0.009432,0.009396,0.9,0.87619,1.0,0.930303,0.875152,0.9


### Logistic Regression

In [25]:
#fit log reg classifier
log_classifiers = {
    'log1':LogisticRegression(),
    'log2':LogisticRegression(max_iter = 1000),
    'log3':LogisticRegression(max_iter = 5000),
    'log4':LogisticRegression(max_iter = 10000),
    'log5':LogisticRegression(max_iter = 50000)
}

scores = {} # Store cross-validation results in a dictionary
for classifier in log_classifiers: 
    scores[classifier] = cross_validate( # perform cross-validation
        log_classifiers[classifier], # classifier object
        X_train_Z, # feature matrix
        y_train, # gold labels
        cv=10, #number of folds
        scoring=['accuracy','precision', 'recall', 'f1', 'f1_macro', 'f1_micro'] # scoring methods
    )
       
compare_scores(scores)

Unnamed: 0,fit_time,score_time,test_accuracy,test_precision,test_recall,test_f1,test_f1_macro,test_f1_micro
log1,0.447972,0.011594,0.8,0.78631,0.983333,0.866317,0.728159,0.8
log2,0.446253,0.011721,0.8,0.78631,0.983333,0.866317,0.728159,0.8
log3,0.450153,0.010538,0.8,0.78631,0.983333,0.866317,0.728159,0.8
log4,0.421565,0.011627,0.8,0.78631,0.983333,0.866317,0.728159,0.8
log5,0.447583,0.011263,0.8,0.78631,0.983333,0.866317,0.728159,0.8


In [26]:
scores = {} # Store cross-validation results in a dictionary
for matrix in matrix_n: 
    print(matrix_n[matrix].shape)
    scores[matrix] = cross_validate( # perform cross-validation
        LogisticRegression(), # classifier object
        matrix_n[matrix], # feature matrix
        y_train, # gold labels
        cv=10, #number of folds
        scoring=['accuracy','precision', 'recall', 'f1', 'f1_macro', 'f1_micro'] # scoring methods
    )

(80, 5000)
(80, 10000)
(80, 15000)
(80, 17500)
(80, 20000)
(80, 22500)
(80, 25000)
(80, 30000)
(80, 30376)


In [27]:
compare_scores(scores)

Unnamed: 0,fit_time,score_time,test_accuracy,test_precision,test_recall,test_f1,test_f1_macro,test_f1_micro
5000,0.181777,0.009573,0.8625,0.844286,0.98,0.902727,0.829697,0.8625
10000,0.332523,0.010715,0.8125,0.80381,0.963333,0.869394,0.76303,0.8125
15000,0.408429,0.010847,0.8125,0.798214,0.983333,0.873893,0.746946,0.8125
17500,0.436322,0.010205,0.8,0.78631,0.983333,0.866317,0.728159,0.8
20000,0.474196,0.010005,0.7875,0.777381,0.983333,0.859907,0.699953,0.7875
22500,0.541965,0.011539,0.8,0.777381,1.0,0.868998,0.714499,0.8
25000,0.606185,0.010137,0.7875,0.765476,1.0,0.861422,0.695711,0.7875
30000,0.762614,0.010939,0.775,0.756548,1.0,0.855012,0.667506,0.775
35000,0.768601,0.011013,0.775,0.756548,1.0,0.855012,0.667506,0.775


## IV. Feature Importance

In [28]:
#source: https://stackoverflow.com/questions/50526898/how-to-get-feature-importance-in-naive-bayes
feat_impNB=MultinomialNB()
s=feat_impNB.fit(matrix_n2['17500'],y_train)

neg_class_prob_sorted = feat_impNB.feature_log_prob_[0, :].argsort()[::-1]
pos_class_prob_sorted = feat_impNB.feature_log_prob_[1, :].argsort()[::-1]


print("The most important words horror books:\n")
display(np.take(vectorizer.get_feature_names_out(), neg_class_prob_sorted[:50]))
print("\nThe most important words detective books:\n")
display(np.take(vectorizer.get_feature_names_out(), pos_class_prob_sorted[:50])) #??? detective=1

The most important words horror books:



array(['content', 'reality', 'closer', 'flesh', 'yield', 'choice',
       'attend', 'occur', 'struggled', 'whence', 'music', 'tender',
       'storm', 'solemn', 'mass', 'distress', 'dreaming', 'kissed',
       'flow', 'equal', 'trembled', 'gladly', 'wandering', 'travel',
       'reasonable', 'strongly', 'endless', 'glory', 'motion',
       'exhausted', 'provided', 'breathing', 'suffer', 'changes',
       'contrast', 'sleeping', 'composed', 'confusion', 'encourage',
       'considering', 'strain', 'wave', 'dreams', 'justified', 'animal',
       'tortured', 'branches', 'dignity', 'turns', 'treated'],
      dtype=object)


The most important words detective books:



array(['marry', 'morrow', 'suspicions', 'famous', 'aroused', 'suspicious',
       'warn', 'valuable', 'stolen', 'smoking', 'clue', 'profession',
       'conceal', 'arrested', 'hunting', 'interests', 'advised',
       'reasonable', 'impatiently', 'finds', 'objection', 'tragedy',
       'prefer', 'liberty', 'exclamation', 'picking', 'begged', 'mud',
       'visitors', 'tells', 'bending', 'absurd', 'gets', 'card',
       'considering', 'propose', 'disappearance', 'lunch', 'crushed',
       'consulted', 'investigation', 'gesture', 'employ', 'informed',
       'appreciate', 'cigar', 'detective', 'blame', 'replaced',
       'reputation'], dtype=object)

## V. Testing/Predicting

In [29]:
vectorizer = TfidfVectorizer(
        encoding='utf-8',
        preprocessor=pre_proc,
        min_df=2, # Note this
        max_df=0.8, # This, too
        binary=False,
        norm='l2',
        use_idf=True, # And this
        max_features=17500)

In [30]:
X_test = vectorizer.fit_transform(testing_books)
y_test = MultinomialNB().fit(matrix_n['17500'], y_train).predict(X_test)

In [31]:
y_test

array([1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

### Testing books predicted Detective

In [32]:
testing_data['y_pred']=y_test
print(len(testing_data[testing_data['y_pred']==1].title.values))
testing_data[testing_data['y_pred']==1].title.values

68


array(['A Round Dozen', 'A Sicillian Romance',
       'Adele Doring at Boarding-School', 'Agnes Grey',
       'An Old-Fashioned Girl', 'Anna Karenina', 'Don Quixote', 'Emma',
       'Flatland', 'Key Out of Time',
       "Little Men: Life at Plumfield With Jo's Boys", 'Little Women',
       'Mansfield Park', 'Mathilda', 'Micah Clarke', 'Middlemarch',
       'Mizora: A Prophecy', 'Moods', 'Mr. Standfast', 'Night and Day',
       "Nobody's Girl", 'Our Mutual Friend', 'Persuasion', 'Plague Ship',
       'Pride and Prejudice', 'Rainbow Valley', 'Sense and Sensibility',
       'Sense and Sensibility', 'Shirley', 'Silas Marner', 'Star Hunter',
       'Star of India', 'Storm over Warlock', 'Summer',
       'The Age of Innocence', 'The Beautiful and Damned', 'The Bell Jar',
       'The Best Made Plans', 'The Betrothed', 'The Colors of Space',
       'The Defiant Agents', 'The Disturbing Charm',
       'The Enchanted April', 'The Fall Of The Grand Sarrasin',
       'The Four Corners', 'The Garde

### Testing books predicted Horror

In [33]:
print(len(testing_data[testing_data['y_pred']==0].title.values))
testing_data[testing_data['y_pred']==0].title.values

10


array(['Anne of Green Gables', 'Black Amazon of Mars',
       'Chaplet Of Pearls', 'House of Mirth', 'The Deluge',
       'The Door Through Space',
       'The Importance of Being Earnest: A Trivial Comedy for Serious People by Oscar Wilde',
       'The Lances of Lynwood', 'The Luckiest Girl in the School',
       'The Mill On The Floss'], dtype=object)

In [34]:
print("fraction of true detective novels in the training dataset:", len(training_data[training_data['detective']==True])/(len(training_books)))
print("fraction of novels of testing dataset predicted detective:", len(testing_data[testing_data['y_pred']==1].title.values)/len(testing_books))

fraction of true detective novels in the training dataset: 0.6375
fraction of novels of testing dataset predicted detective: 0.8717948717948718


## VI. Visualizations

In [35]:
all_books=training_books+testing_books
vis=vectorizer.fit_transform(all_books)
labels=np.hstack((y_train, y_test))

In [36]:
training_data['shape']='train'
testing_data['shape']='test'
books_order=pd.concat([training_data, testing_data])
books_order['y']=labels
books_order['gender_author1']=books_order['gender_author1'].str.lower()
books_order

Unnamed: 0.1,Unnamed: 0,check_1,check_2,title,year,author1_surname,author1_givenname,author2_surname,author2_givenname,gender_author1,...,tragedy,children,regency,manners,philosophical,coming-of-age,filename,shape,y_pred,y
0,tl566,hz542,ja532,813,1910,Leblanc,Maurice,,,male,...,False,False,False,False,False,False,Leblanc_813.txt,train,,1
1,gc386,,,A Strange Disappearance,1998,Green,Anna Katharine,,,female,...,False,False,False,False,False,False,GreenAnnaKatharine_AStrangeDisappearance.txt,train,,1
2,nca28,tl566,stw43,A Study in Scarlet,1887,Conan Doyle,Arthur,,,male,...,False,False,False,False,False,False,ConanDoyle_AStudyInScarlet.txt,train,,1
3,jc2739,,,Agatha Webb,1899,Green,Anna Katharine,,,female,...,False,False,False,False,False,False,Green_AgathaWebb.txt,train,,1
4,lcc82,yk499,,Carmilla,1872,Le_Fanu,Joseph Sheridan,,,male,...,False,False,False,False,False,False,Carmilla.txt,train,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73,jc2739,,,This Side of Paradise,1920,Fitzgerald,F. Scott,,,male,...,False,False,False,False,False,True,Fitzgerald_ThisSideOfParadise.txt,test,1.0,1
74,vs339,thh55,,To Kill A Mockingbird,1960,Lee,Harper,,,female,...,False,False,False,False,False,False,Lee_ToKillAMockingbird.txt,test,1.0,1
75,fhh26,gs542,tj256,Twenty Thousand Leagues Under the Sea,1870,Verne,Jules,,,male,...,False,False,False,False,False,False,Verne_TwentyThousandLeagues.txt,test,1.0,1
76,dgr73,jlp367,kg428,Uncle Tom's Cabin,1852,Stowe,Harriet Beecher,,,female,...,False,False,False,False,False,False,Stowe_UncleTom_sCabin.txt,test,1.0,1


In [39]:
# source: mp2
coords_books = TruncatedSVD(n_components=2).fit_transform(vis)

# 4. Results and discussion (40 points)

In [None]:
# few paragraphs
# realistic - 3 results
# figure, table, accuracy table
# analyze each of the results, one paragraph at the end how they fit together
# group -- a bit more

# 5. Reflection (10 points)

# 6. Resources consulted (0 points, but -5 if missing)

`pre_proc()`: MP 02, INFO 3350 <br>
`compare_scores()`: MP 02, INFO 3350<br>
Feature importance for sklearn Naive Bayes: https://stackoverflow.com/questions/12618030/how-to-replace-back-slash-character-with-empty-string-in-python<br>
SVD Scatterplot: MP 02, INFO 3350

# 7. Responsibility statement (0 points, but -5 if missing)
**See separate CMS assignment 'MP 03: Responsibility statement'.**