# Pandas Practice

## Imports

As always, we begin with some imports.

In [10]:
import pandas as pd
from pandas import DataFrame
from pandas import Series, DataFrame
import glob
from pathlib import Path

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

## Corpus 

For this notebook, we'll return to our corpus of _New York Times_ obituaries.

In [2]:
# collect filepaths as files
directory = "../docs/NYT-Obituaries/"
files = glob.glob(f"{directory}/*.txt")
files

['../docs/NYT-Obituaries/1945-Adolf-Hitler.txt',
 '../docs/NYT-Obituaries/1915-F-W-Taylor.txt',
 '../docs/NYT-Obituaries/1975-Chiang-Kai-shek.txt',
 '../docs/NYT-Obituaries/1984-Ethel-Merman.txt',
 '../docs/NYT-Obituaries/1953-Jim-Thorpe.txt',
 '../docs/NYT-Obituaries/1964-Nella-Larsen.txt',
 '../docs/NYT-Obituaries/1955-Margaret-Abbott.txt',
 '../docs/NYT-Obituaries/1984-Lillian-Hellman.txt',
 '../docs/NYT-Obituaries/1959-Cecil-De-Mille.txt',
 '../docs/NYT-Obituaries/1928-Mabel-Craty.txt',
 '../docs/NYT-Obituaries/1973-Eddie-Rickenbacker.txt',
 '../docs/NYT-Obituaries/1989-Ferdinand-Marcos.txt',
 '../docs/NYT-Obituaries/1991-Martha-Graham.txt',
 '../docs/NYT-Obituaries/1997-Deng-Xiaoping.txt',
 '../docs/NYT-Obituaries/1938-George-E-Hale.txt',
 '../docs/NYT-Obituaries/1885-Ulysses-Grant.txt',
 '../docs/NYT-Obituaries/1909-Sarah-Orne-Jewett.txt',
 '../docs/NYT-Obituaries/1957-Christian-Dior.txt',
 '../docs/NYT-Obituaries/1987-Clare-Boothe-Luce.txt',
 '../docs/NYT-Obituaries/1976-Jacques

In [3]:
# and collect obit titles, which are also the final section of the filepaths
obit_titles = [Path(file).stem for file in files]
obit_titles

['1945-Adolf-Hitler',
 '1915-F-W-Taylor',
 '1975-Chiang-Kai-shek',
 '1984-Ethel-Merman',
 '1953-Jim-Thorpe',
 '1964-Nella-Larsen',
 '1955-Margaret-Abbott',
 '1984-Lillian-Hellman',
 '1959-Cecil-De-Mille',
 '1928-Mabel-Craty',
 '1973-Eddie-Rickenbacker',
 '1989-Ferdinand-Marcos',
 '1991-Martha-Graham',
 '1997-Deng-Xiaoping',
 '1938-George-E-Hale',
 '1885-Ulysses-Grant',
 '1909-Sarah-Orne-Jewett',
 '1957-Christian-Dior',
 '1987-Clare-Boothe-Luce',
 '1976-Jacques-Monod',
 '1954-Getulio-Vargas',
 '1979-Stan-Kenton',
 '1990-Leonard-Bernstein',
 '1972-Jackie-Robinson',
 '1998-Fred-W-Friendly',
 '1991-Leo-Durocher',
 '1915-B-T-Washington',
 '1997-James-Stewart',
 '1981-Joe-Louis',
 '1983-Muddy-Waters',
 '1942-George-M-Cohan',
 '1989-Samuel-Beckett',
 '1962-Marilyn-Monroe',
 '2000-Charles-M-Schulz',
 '1967-Gregory-Pincus',
 '1894-R-L-Stevenson',
 '1978-Bruce-Catton',
 '1982-Arthur-Rubinstein',
 '1875-Andrew-Johnson',
 '1974-Charles-Lindbergh',
 '1964-Rachel-Carson',
 '1953-Marjorie-Rawlings',


## Create document-term matrix

### Initiate CountVectorizer as vectorizer

Remember document-term matrices, aka doc-term matrices, aka dtms? We learned about them in notebooks 10 and 11. We build it with scikit-learn's CountVectorizer, which we imported at the start of the lesson. 

When we load our vectorizer, we include an argument to encode as utf-8 and we load our stopwords. We can also set the minimum number of times a word must appear in the corpus to be included in the dtm. In this case, I've set it at 20.

In [4]:
# load stopwords
from sklearn.feature_extraction import text
text.ENGLISH_STOP_WORDS

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

In [5]:
text_file = open('../docs/jockers_stopwords.txt')
jockers_words = text_file.read().split()
jockers_words

['0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 'a',
 'aaron',
 'abbey',
 'abbie',
 'abdul',
 'abe',
 'across',
 'abel',
 'abigail',
 'about',
 'above',
 'abraham',
 'abram',
 'abst',
 'accordance',
 'according',
 'act',
 'actually',
 'ada',
 'adah',
 'adalberto',
 'adaline',
 'adam',
 'adan',
 'added',
 'among',
 'addie',
 'adela',
 'adelaida',
 'adelaide',
 'adele',
 'adelia',
 'adelina',
 'adeline',
 'adell',
 'adella',
 'adelle',
 'adena',
 'adina',
 'adj',
 'adolfo',
 'adolph',
 'adopted',
 'adria',
 'adrian',
 'adriana',
 'adriane',
 'adrianna',
 'adrien',
 'adrienne',
 'after',
 'afterwards',
 'afton',
 'again',
 'against',
 'agatha',
 'agnes',
 'agnus',
 'agueda',
 'agustina',
 'ahmad',
 'ahmed',
 'ai',
 'aida',
 'besides',
 'aide',
 'aiko',
 'aileen',
 'ailene',
 'aimee',
 'aja',
 'akilah',
 'al',
 'alaina',
 'alaine',
 'alan',
 'alana',
 'alane',
 'alanna',
 'alayna',
 'alba',
 'albert',
 'alberta',
 'albertha',
 'albertina',
 'albertine',
 'alberto',
 'albina',


In [6]:
new_stopwords = text.ENGLISH_STOP_WORDS.union(jockers_words)
new_stopwords

frozenset({'basil',
           'tempie',
           'madalene',
           'cecille',
           'ranae',
           'griselda',
           'jadwiga',
           'brandee',
           'clement',
           'melba',
           'pp',
           'charmain',
           'toward',
           'katherina',
           'marcella',
           'yon',
           'enough',
           'phil',
           'valarie',
           'sherita',
           'heather',
           'annice',
           'adaline',
           'darlene',
           'stephane',
           'russell',
           'recently',
           'christena',
           'irwin',
           'melodi',
           'ana',
           'andra',
           'kareen',
           'esteban',
           'go',
           'kathrin',
           'vincenzo',
           'thersa',
           'dulce',
           'margrett',
           'brant',
           'mardell',
           'laurence',
           'ozell',
           'shaquita',
           'luisa',
           'thereby'

In [7]:
corpus_path = '../docs/NYT-Obituaries/'
vectorizer = CountVectorizer(input='filename', encoding='utf8', stop_words = new_stopwords, min_df=20, dtype='float64')

### Make list of filepaths

CountVectorizer builds a dtm from a list of filepaths.

In [8]:
corpus = []
for title in obit_titles:
    filename = title + ".txt"
    corpus.append(corpus_path + filename)
dtm = vectorizer.fit_transform(corpus)

### Get feature names and set as column titles

The columns store word counts. We want to name the columns with the words stored in each, and to transform the dtm into a pandas dataframe, as follows:

In [12]:
vocab = vectorizer.get_feature_names()
matrix = dtm.toarray()
df = DataFrame(matrix, columns=vocab)
print('df shape is: ' + str(df.shape))

df shape is: (378, 2985)


Our dataframe has 378 rows, one for each document, or obituary, and 2985 columns, one for each word that's not in stopwords and appears at least 20 times in the corpus.

## Import metadata

In [13]:
meta = pd.read_csv("../docs/NYT-Obituaries.csv", encoding = 'utf-8')
meta

Unnamed: 0,title,gender,obit,date
0,1945-Adolf-Hitler,0,Hitler Fought Way to Power Unique in Modern Hi...,1945.0
1,1915-F-W-Taylor,0,"F. W. Taylor, Expert in Efficiency, Dies BY TH...",1915.0
2,1975-Chiang-Kai-shek,0,The Life of Chiang Kai-shek: A Leader Who Was ...,1975.0
3,1984-Ethel-Merman,1,"Ethel Merman, Queen of Musicals, Dies at 76 By...",1984.0
4,1953-Jim-Thorpe,0,Jim Thorpe Is Dead On West Coast at 64 Special...,1953.0
...,...,...,...,...
373,1987-Andres-Segovie,0,Andres Segovie Is Dead at 94; His Crusade Elev...,1987.0
374,1987-Rita-Hayworth,1,"Rita Hayworth, Movie Legend, Dies By ALBIN KRE...",1987.0
375,1993-William-Golding,0,"June 20, 1993 William Golding Is Dead at 81; ...",1993.0
376,1932-Florenz-Ziegfeld,1,Florenz Ziegfeld Dies in Hollywood After Long ...,1932.0


In [14]:
meta = meta.rename(columns={'title': 'obit_title'})
meta

Unnamed: 0,obit_title,gender,obit,date
0,1945-Adolf-Hitler,0,Hitler Fought Way to Power Unique in Modern Hi...,1945.0
1,1915-F-W-Taylor,0,"F. W. Taylor, Expert in Efficiency, Dies BY TH...",1915.0
2,1975-Chiang-Kai-shek,0,The Life of Chiang Kai-shek: A Leader Who Was ...,1975.0
3,1984-Ethel-Merman,1,"Ethel Merman, Queen of Musicals, Dies at 76 By...",1984.0
4,1953-Jim-Thorpe,0,Jim Thorpe Is Dead On West Coast at 64 Special...,1953.0
...,...,...,...,...
373,1987-Andres-Segovie,0,Andres Segovie Is Dead at 94; His Crusade Elev...,1987.0
374,1987-Rita-Hayworth,1,"Rita Hayworth, Movie Legend, Dies By ALBIN KRE...",1987.0
375,1993-William-Golding,0,"June 20, 1993 William Golding Is Dead at 81; ...",1993.0
376,1932-Florenz-Ziegfeld,1,Florenz Ziegfeld Dies in Hollywood After Long ...,1932.0


In [15]:
meta = meta[["obit_title", "gender", "date"]]
meta

Unnamed: 0,obit_title,gender,date
0,1945-Adolf-Hitler,0,1945.0
1,1915-F-W-Taylor,0,1915.0
2,1975-Chiang-Kai-shek,0,1975.0
3,1984-Ethel-Merman,1,1984.0
4,1953-Jim-Thorpe,0,1953.0
...,...,...,...
373,1987-Andres-Segovie,0,1987.0
374,1987-Rita-Hayworth,1,1987.0
375,1993-William-Golding,0,1993.0
376,1932-Florenz-Ziegfeld,1,1932.0


Our metadata is stored as a pandas dataframe with a row for each obituary and three columns: title, gender, and year.

## Concatenate metadata and doc-term dataframe

In [16]:
df_concat = pd.concat([meta, df], axis = 1)

In [18]:
df_concat

Unnamed: 0,obit_title,gender,date,000,10,100,11,12,13,14,...,wrote,yale,year,years,yellow,yesterday,york,younger,youngest,youth
0,1945-Adolf-Hitler,0,1945.0,21.0,1.0,0.0,2.0,3.0,4.0,3.0,...,3.0,0.0,11.0,19.0,0.0,0.0,1.0,1.0,0.0,9.0
1,1915-F-W-Taylor,0,1915.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,1975-Chiang-Kai-shek,0,1975.0,3.0,3.0,1.0,1.0,0.0,0.0,0.0,...,6.0,0.0,3.0,14.0,0.0,0.0,1.0,2.0,1.0,1.0
3,1984-Ethel-Merman,1,1984.0,0.0,1.0,0.0,1.0,0.0,1.0,2.0,...,0.0,0.0,3.0,5.0,0.0,2.0,5.0,0.0,0.0,0.0
4,1953-Jim-Thorpe,0,1953.0,2.0,3.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,6.0,1.0,0.0,0.0,4.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
373,1987-Andres-Segovie,0,1987.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,2.0,0.0,3.0,14.0,0.0,0.0,7.0,0.0,0.0,0.0
374,1987-Rita-Hayworth,1,1987.0,0.0,2.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,8.0,0.0,1.0,2.0,0.0,0.0,0.0
375,1993-William-Golding,0,1993.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,1.0,0.0,0.0,0.0,1.0
376,1932-Florenz-Ziegfeld,1,1932.0,9.0,3.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,4.0,6.0,0.0,0.0,2.0,2.0,0.0,0.0


## Equalize numbers of men and women

We want our dataframe to have equal numbers of men and women. How many women are there? Women are counted as 1 and men as 0, so if we sum the gender column, we'll have the number of women:

In [20]:
# count the number of women
#df_concat[df_concat["gender"]==1].count()
df_concat["gender"].sum()

93

Then we separate men and women into two dataframes and take a random sample of 93 obituaries about men.

In [24]:
# create a dataframe with just the men
df_men = df_concat[df_concat["gender"]==0]
df_men

Unnamed: 0,obit_title,gender,date,000,10,100,11,12,13,14,...,wrote,yale,year,years,yellow,yesterday,york,younger,youngest,youth
0,1945-Adolf-Hitler,0,1945.0,21.0,1.0,0.0,2.0,3.0,4.0,3.0,...,3.0,0.0,11.0,19.0,0.0,0.0,1.0,1.0,0.0,9.0
1,1915-F-W-Taylor,0,1915.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,1975-Chiang-Kai-shek,0,1975.0,3.0,3.0,1.0,1.0,0.0,0.0,0.0,...,6.0,0.0,3.0,14.0,0.0,0.0,1.0,2.0,1.0,1.0
4,1953-Jim-Thorpe,0,1953.0,2.0,3.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,6.0,1.0,0.0,0.0,4.0,0.0,0.0,0.0
8,1959-Cecil-De-Mille,0,1959.0,11.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,3.0,5.0,0.0,0.0,2.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371,1945-Jerome-Kern,0,1945.0,2.0,1.0,1.0,1.0,1.0,0.0,0.0,...,3.0,0.0,0.0,3.0,0.0,1.0,4.0,0.0,0.0,0.0
372,1991-Frank-Capra,0,1991.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,2.0,4.0,0.0,1.0,1.0,1.0,0.0,0.0
373,1987-Andres-Segovie,0,1987.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,2.0,0.0,3.0,14.0,0.0,0.0,7.0,0.0,0.0,0.0
375,1993-William-Golding,0,1993.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,1.0,0.0,0.0,0.0,1.0


In [25]:
# create a dataframe with just the women
df_women = df_concat[df_concat["gender"]==1]
df_women

Unnamed: 0,obit_title,gender,date,000,10,100,11,12,13,14,...,wrote,yale,year,years,yellow,yesterday,york,younger,youngest,youth
3,1984-Ethel-Merman,1,1984.0,0.0,1.0,0.0,1.0,0.0,1.0,2.0,...,0.0,0.0,3.0,5.0,0.0,2.0,5.0,0.0,0.0,0.0
5,1964-Nella-Larsen,1,1964.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,2.0,0.0,3.0,3.0,0.0,0.0,5.0,0.0,0.0,0.0
6,1955-Margaret-Abbott,1,1955.0,0.0,4.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,2.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
7,1984-Lillian-Hellman,1,1984.0,6.0,1.0,0.0,0.0,0.0,1.0,1.0,...,12.0,2.0,7.0,9.0,0.0,2.0,7.0,1.0,0.0,0.0
9,1928-Mabel-Craty,1,1928.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,0.0,1.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
353,1910-Florence-Nightingale,1,1854.0,3.0,0.0,0.0,0.0,1.0,0.0,1.0,...,2.0,0.0,2.0,7.0,0.0,1.0,1.0,0.0,0.0,1.0
355,1986-The-Challenger,1,1986.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,4.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
357,1998-Galina-Ulanova,1,1998.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,1.0,1.0,0.0,0.0,4.0,0.0,0.0,0.0
374,1987-Rita-Hayworth,1,1987.0,0.0,2.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,8.0,0.0,1.0,2.0,0.0,0.0,0.0


In [28]:
# take a random sample of 93 men
df_men = df_men.sample(93)
df_men

Unnamed: 0,obit_title,gender,date,000,10,100,11,12,13,14,...,wrote,yale,year,years,yellow,yesterday,york,younger,youngest,youth
320,1993-Albert-Sabin,0,1993.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,6.0,0.0,3.0,3.0,0.0,0.0,0.0
48,1941-Frank-Conrad,0,1941.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0
265,1969-Coleman-Hawkins,0,1969.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,2.0,0.0,0.0,6.0,0.0,1.0,1.0,0.0,0.0,0.0
209,1933-Calvin-Coolidge,0,1933.0,6.0,1.0,0.0,1.0,0.0,0.0,0.0,...,8.0,0.0,7.0,14.0,0.0,0.0,8.0,0.0,0.0,0.0
224,1954-Henri-Matisse,0,1954.0,2.0,0.0,1.0,0.0,0.0,0.0,1.0,...,2.0,0.0,0.0,7.0,0.0,1.0,3.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
184,1993-Arthur-Ashe,0,1993.0,0.0,2.0,0.0,0.0,1.0,1.0,1.0,...,0.0,0.0,7.0,5.0,0.0,1.0,5.0,0.0,0.0,1.0
129,1937-John-Rockefeller,0,1937.0,52.0,6.0,2.0,3.0,2.0,0.0,1.0,...,3.0,0.0,10.0,31.0,0.0,0.0,7.0,1.0,0.0,0.0
171,1931-Thomas-Edison,0,1931.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,2.0,0.0,1.0,0.0,0.0,0.0
95,1987-James-Baldwin,0,1987.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,0.0,5.0,0.0,4.0,4.0,0.0,0.0,1.0


We then concatenate the sampled men dataframe with the women dataframe and reset the index.

In [29]:
# concatenate sampled men df with women df
df_final = pd.concat([df_men, df_women])
df_final

Unnamed: 0,obit_title,gender,date,000,10,100,11,12,13,14,...,wrote,yale,year,years,yellow,yesterday,york,younger,youngest,youth
320,1993-Albert-Sabin,0,1993.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,6.0,0.0,3.0,3.0,0.0,0.0,0.0
48,1941-Frank-Conrad,0,1941.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0
265,1969-Coleman-Hawkins,0,1969.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,2.0,0.0,0.0,6.0,0.0,1.0,1.0,0.0,0.0,0.0
209,1933-Calvin-Coolidge,0,1933.0,6.0,1.0,0.0,1.0,0.0,0.0,0.0,...,8.0,0.0,7.0,14.0,0.0,0.0,8.0,0.0,0.0,0.0
224,1954-Henri-Matisse,0,1954.0,2.0,0.0,1.0,0.0,0.0,0.0,1.0,...,2.0,0.0,0.0,7.0,0.0,1.0,3.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
353,1910-Florence-Nightingale,1,1854.0,3.0,0.0,0.0,0.0,1.0,0.0,1.0,...,2.0,0.0,2.0,7.0,0.0,1.0,1.0,0.0,0.0,1.0
355,1986-The-Challenger,1,1986.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,4.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
357,1998-Galina-Ulanova,1,1998.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,1.0,1.0,0.0,0.0,4.0,0.0,0.0,0.0
374,1987-Rita-Hayworth,1,1987.0,0.0,2.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,8.0,0.0,1.0,2.0,0.0,0.0,0.0


In [31]:
# reset the index
df_final = df_final.reset_index()
df_final

Unnamed: 0,level_0,index,obit_title,gender,date,000,10,100,11,12,...,wrote,yale,year,years,yellow,yesterday,york,younger,youngest,youth
0,0,320,1993-Albert-Sabin,0,1993.0,2.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,6.0,0.0,3.0,3.0,0.0,0.0,0.0
1,1,48,1941-Frank-Conrad,0,1941.0,0.0,0.0,0.0,2.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0
2,2,265,1969-Coleman-Hawkins,0,1969.0,0.0,1.0,0.0,1.0,0.0,...,2.0,0.0,0.0,6.0,0.0,1.0,1.0,0.0,0.0,0.0
3,3,209,1933-Calvin-Coolidge,0,1933.0,6.0,1.0,0.0,1.0,0.0,...,8.0,0.0,7.0,14.0,0.0,0.0,8.0,0.0,0.0,0.0
4,4,224,1954-Henri-Matisse,0,1954.0,2.0,0.0,1.0,0.0,0.0,...,2.0,0.0,0.0,7.0,0.0,1.0,3.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181,181,353,1910-Florence-Nightingale,1,1854.0,3.0,0.0,0.0,0.0,1.0,...,2.0,0.0,2.0,7.0,0.0,1.0,1.0,0.0,0.0,1.0
182,182,355,1986-The-Challenger,1,1986.0,1.0,0.0,0.0,1.0,0.0,...,1.0,0.0,4.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
183,183,357,1998-Galina-Ulanova,1,1998.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,1.0,1.0,0.0,0.0,4.0,0.0,0.0,0.0
184,184,374,1987-Rita-Hayworth,1,1987.0,0.0,2.0,0.0,0.0,1.0,...,0.0,0.0,1.0,8.0,0.0,1.0,2.0,0.0,0.0,0.0


We now have 186 rows: 93 men, 93 women.

In [None]:
meta['gender'].sum()

In [None]:
df_men = df_concat[df_concat['gender'] == 0]
df_women = df_concat[df_concat['gender'] == 1]
df_men = df_men.sample(n=93)

In [None]:
df_final = pd.concat([df_men, df_women])
df_final


In [None]:
df_final = df_final.reset_index()
df_final
df_final = df_final.drop(columns="index")
df_final

### Match meta and data dataframes with subset of df_final

We'll continue to use meta and df, so we need to ensure they match our subsetted df_final

In [None]:
meta = df_final[["obit_title", "gender", "date"]]
meta

In [None]:
df = df_final.loc[:,'000':]
df

## Let's run our classifier!

Once we have a dataframe with metadata and vocab counts we're ready to run our classifier!

### We add columns for probabilities and predicted class to our metadata

As we run the model, we are going to store its output with our metadata. This will allow us to easily examine the model's output.

In [None]:
meta['PROBS'] = ''
meta['PREDICTED'] = ''

### Load model

We will use scikit-learn's `LogisticRegression` model. There are many other options for classifier models. Some are better for some tasks, other for others. LogisticRegression is standard for classifying literature. We set the penalty as l1 and the 'C' value as 1.0. If you decide to specialize in classification, you can explore further the implications of these arguments.

In [None]:
model = LogisticRegression(penalty = 'l1', C = 1.0, solver='liblinear')

### Run the model!

We run the model in the following for-loop.

Classification models need classes: they need the texts grouped into different sets. Our metadata has built-in classes: gender. Men are stored as 0; women as 1. We could, if we wanted, create a new 0/1 class based on year.

Each iteration trains on all the titles except one, then predicts which class the excluded title belongs to. We'll call this leave-one-out classification. There are other ways of dividing training and testing sets, which we won't explore today.

The first four indented lines simply track our progress by printing index, title, and class. The next four lines exclude a single title, and set the training data and the test data.

The final six lines fit the model, calculate the probabilities and predicted class of the test case, and add that information to our metadata dataframe.

In [None]:
for this_index in df_final.index.tolist():
    print(this_index) # keep track of where we are in the corpus
    title = meta.loc[meta.index[this_index], 'obit_title'] 
    CLASS = meta.loc[meta.index[this_index], 'gender']
    print(title, CLASS) 
    
    train_index_list = [index_ for index_ in df.index.tolist() if index_ != this_index] # exclude the title to be predicted
    X = df.loc[train_index_list] # the model trains on all the data except the excluded title row
    y = meta.loc[train_index_list, 'gender'] # the y row tells the model which class each title belongs to
    TEST_CASE = df.loc[[this_index]]

    model.fit(X,y) # fit the model
    prediction = model.predict_proba(TEST_CASE) # calculate probability of test case
    predicted = model.predict(TEST_CASE) # calculate predicted class of test case
    meta.at[this_index, 'PREDICTED'] = predicted # add predicted class to metadata
    meta.at[this_index, 'PROBS'] = str(prediction) # add probabilities to metadata
    print('Class is: ' + str(CLASS) + '\n' + 'Prediction is: ' + str(predicted) + ' ' + str(prediction) + '\n')

How cool is this! For each obituary, we see who it's about, that person's gender (0 or 1), and which gender the model thinks it's about, by which probabilities. 

What can you glean by glancing through?

ANSWER HERE
* *
* *
* *
* *

## Results

Remember, we've stored our results in our metadata dataframe. Let's take a look!

In [None]:
meta

There's lots to look at here. We could explore probabilities: which obituaries is the model most sure about? Which are closest to 50-50? Which does it get most right and most wrong? Is there a pattern to misclassified obituaries?

For now, we just want to calculate its accuracy. Let's get rid of those brackets in the PREDICTED column.

In [None]:
meta = meta.replace([0], 0)
meta = meta.replace([1], 1)
meta

### Result column

Now we can add a 'RESULT' column that is the result of subtracting the predicted gender from the actual gender.

0 means the model was correct.
-1 means the model mistook a man for a woman.
1 means the model mistook a woman for a man.

In [None]:
sum_column = meta['gender'] - meta['PREDICTED']
meta['RESULT'] = sum_column
meta

Let's look at the accurate guesses.

In [None]:
meta = meta[meta['RESULT'] == 0]
meta

## Accuracy

How many did the model get correct?

We can calculate its accuracy by dividing the correct number by the total.

In [None]:
# divide here

Pretty good rate! At random, the model should guess correctly 50% of the time. It does **much** better than that!

In the next lesson, we'll explore _how_ the model made its calculations by learning which words matter.

## BONUS

If you want to explore further, divide the obituaries by year rather than gender. You'll need to do some data clearning and some manipulation of pandas to make it work. Can you figure it out?

If you're feeling REALLY ambitious, you can use this notebook and the next one to run classification over your own data. All you need is a text corpus with binary metadata (that is, metadata that can be divided into two classes). Classification works best if you have at least around eighty texts of each class, so our lyrics corpus is too small (though you could try and see what happens.