# Gender in norwegian novels

* How to find the distribution of gender in novels
* Are females more likely to be referred to again than males?


In [None]:
# uncomment the line below (remove #) and run of gender-guesser is not installed
#!pip install gender-guesser

In [None]:
import gender_guesser.detector as gender
import dhlab.module_update as mu
import dhlab.nbtext as nb
import requests
import pandas as pd
import json
from collections import Counter
mu.update("wordbank")
import wordbank as wb
mu.css()

In [None]:
detect = gender.Detector()

## Build a corpus using metadata

search using author dewey, subject translation etc.

In [None]:
nb.book_corpus(author="knaus%karl%")

### Look up metadata for `2012112638153`

In [None]:
nb.metadata(2012112638153)

### Collect the frequency for this book

In [None]:
book = nb.frame(nb.get_freq(2012112638153, top=0, cutoff=0))
book.head(20)

### Initial gender distribution with pronouns

Her with the accusative forms - most frequent

In [None]:
book.loc[['han', 'hun']]

### Find words with capital letters

Heuristics for a name candidate:

1. Starts with a capital letter
1. Only first letter is capital
1. It won't occur without a capital letter

In [None]:
capitals = [x for x in book.index 
            if x.upper()[0] == x[0] 
            and x.upper() != x 
            and x.isalpha() 
            and not x.lower() in book.index]

### Take a quick look at wordbank

In [None]:
wb.word_form_many(capitals)

In [None]:
wb.word_form_many(['Ask', "Per", "Lars", "Bjørn", "bjørn"])

### Collect gender data for words in the book

In [None]:
gender_data = [(c, detect.get_gender(c)) for c in capitals]

In [None]:
gf = pd.DataFrame(gender_data, columns = ['name', 'gender']).set_index('name')
gf.head(30)

### Count the different males

In [None]:
gf[gf['gender'] == 'female'].count()

In [None]:
gf[gf['gender'] == 'male'].count()

### Find names

In [None]:
gf[gf['gender'] == 'male'].head(20)

### Fetch some numbers

Have a look at transfer to variables

In [None]:
book.loc[['han', 'hun', 'jeg']]

In [None]:
han = book.loc['han']
hun = book.loc['hun']
jeg = book.loc['jeg']

### Count the occurrences

Each name occurs a couple of times. Here we count how often the different males occur. First. let's have a look at the dataframe of male names, the frequency list. Note possible source of error, first name and last name may count double

In [None]:
book.loc[gf.index].head(20)

### Sum up males and females

In [None]:
males = book.loc[gf[gf['gender'] == 'male'].index].sum()
males

In [None]:
females = book.loc[gf[gf[1] == 'female'].index].sum()
females

### Compare with the pronouns

In [None]:
males/females

In [None]:
han/hun

### Greater chance of referring to males than females

In [None]:
han/males

In [None]:
hun/females

## Exercise

Change the metadata and choose a different book