Current Data Set: use the 2020-05-18-10-06 version of the train/test data
==================

This notebook (nb) has notes pertaining to the exploration of the relationship between binary gender labels {female, male} and genre labels (pop, rock, country, hip-hop, etc...) used by Wikipedia.

### Research question: Is there a gender bias in the music artist Wikipedia genre labels?

We approach this question through three types of calculations:

1. How does the male/female split depend on the following factors?
    - [ ] list size
    - [ ] each individual genre label
    - [ ] genre combinations
2. Can we build a model that can predict gender from the genre labels alone?
    - [ ] What is an upper bound for the sample that we have?
    - [ ] What are predictive features?
    
3. Do word embeddings exhibit clustering that is biased with respect to gender?
    - [ ] Create word embeddings, define clusters, calculate female/male percentages for each cluster
    - [ ] create word embeddings using prediction task

Reports:

 - basic stats
     - [ ] compare actual gender bias for each genre compared to expected stastical fluctuations for an unbiased model
 - twobins
 - report on model limits
 - network graphs: communities and gender
 - word embeddings: communities and gender
 - f_bias and m_bias lists
 

Basic info on the data:

* There are X artists; X femaile, X male.
* There are X unique genre labels
* The mean number of labels are X for female and X for male.
* Density histograms show that the distribution of the number of labels per artist is fairly similar for male and female artists (check this)

Sections for Final Report:

- over view
- data set: source, scrape, clean, verify, collapse genres
- results
    - basic stats
    - modeling
    - networks
- discussion/conclusion



### Data Pipeline:

This needs to be elaborated.

- use tf_genre conda environment for tensorflow
- genre_lists.txt is generated from the X_train data in genre_convert_genre_data_to_single_line_docs.ipynb

### For gender verification of 1% sample. 

Put agency primarily on the musician themselves. If 1. is inconclusve use 2., then mark verified. Else, mark unverified.

1. Musician social media, bandcamp, or website.
2. Wikipedia
3. Press

Tom and Dan each verified 0.5%. Tom found 100% accuracy; Dan found 76/77 correct genders; the incorrect assigned male to a band of multiple people; all bands were subsequently dropped.

### Things to do:

Tom: 

- [ ] Tom: contact Horvat
- [ ] Tom: associated acts networks

Dan: 

- [x] Dan: Create cleaned data set from Wikipedia scrape

### Begin Analyzing

- [ ] Dan: redo analyses with outliers removed
- [ ] use genre gender bias and network centrality together
- [ ] input: genre; output: genres it occurs with in lists of a size - make web app
- [x] Dan: make train-test split
- [ ] Basic Stats:
    - [x] Dan: what are most common genre labels in data
    - [ ] Dan: remove genre/artist pairs for which the genre label appears only for that artist and that artist only has one genre label?
    - [ ] Dan: tag cloud?
    - [x] which genres are most likely to be present in large genre lists?
        - [x] for heat map, use relative measure
        - [ ] use cardinality of >= n
    - [ ] correlation matrix: x,y is number of times that x and y appear in a list; absolute and relative
    - [ ] 
- [ ] groupings from word embeddings
    - [ ] apply k-means on top of word2vec and other vector representations
    - [ ] how do the clusters above correlate with gender?
    - [ ] transformations between clusters? subtract gender from male dominated genre? find gender analogous
    - [ ] evaluate the difference between pretrained word embeddings and those learned with our tasks
- [ ] Dan: graph to visualize genre-promiscuous artists using colored clusters: 
    - x-axis is number of time a genre label appears (ordered bins)
    - y-axis sample distribution of size of genre label bags in which it appears (use heat map)
- [ ] Dan: which genres are present in genre-promiscuous artists?

### Modeling

- [x] bayesian
- [x] logistic regression
 - [ ] standard with L_1 (when N <30p want to try L_1, otherwise statistics aren't good)
 - [ ] use pairs of labels with L_1 regularization
- [ ] use word embedding/clustering to define relatedness of genres: define omnivorous index
- [ ] for linear model, expect single label appearances to not give much weight to features in model
- [ ] create streamlit interface so that Tom can run queries (practice for Dan!)


### Web App
- [x] I like the overview information upfront. Could include the total number of genre tags beneath the "1494 unique genre labels".  Could also add the mean/median # of tags per musician there.
- [x] Is it possible to have the selection genre for the Co-Occurrences to be ordered alphabetically rather than in order of frequency?
- [x] select a genre, get the list of artists with that genre
- [x] select an artist in the corpus, see their genres.
- [ ] add filter/range options maybe; so return only say co-occurrences n>5 or something.
- [ ] stats from different models
- [ ] visuals to show how gender and genre omnivorousness relate

### Later: 
- [ ] Dan: find out input data structure for neural net to analyze graph
- [ ] carry out analysis for other platforms: spotify, tidal 
- [ ] __How does genre label interact with race?__

### Notes on change to genre label spelling

Word2Vec doesn't deal with '&' well, so all '&' are being changed to "\_and_" even though 'r&b' or "r 'n b" is the term in use, not "r and b".

### Notes from discussion 2020-04-29

- in methodology, describe scraping: 'stoner-prog' became 'stoner rock' + 'progressive rock' because of how the html was structured; can we count the number of times this happened? 

### Notes from discussion 2020-04-20

- [ ] ask Horvat about using Spotify data?

Revised research plan:

- Does Wikipedia encode gender and genre differently than Spotify/Horvat? Is the gender encoding robust accross different databases?

- Scraper is working on Wikipedia

Networks

- find clusters and sparse connections -- who are the sparse connectors?
- which way to network edges point: 
    -if two dense clusters are connected by a few directed edges, 
    -through which artists are the connections made and which way do they point

### Notes from discussion on 2020-04-09

Tom's script: search name, then return first results of that: sometimes gives wrong band

problems with data: 

may not return correct band info

look at >5 genre tag bins, which genre labels are in there? rock, indie? 
Can we correlate genres in high cardinality with male dominated labels?

ask Horvat for data?

intergenre promiscuity is higher for male dominated genre; 
link to omnivorous listeners

can use network analysis to show that the genre promiscuity is artificially imposed/ superfluous : all links could be made by rock/indie rather than subgenres

have to separate genre tags and recommendation system

is there directionaly in genre labels: carribean + rock -> rock and not to carribean; rock attracts carribean, but 

could test experimentally by creating listeners?

Book: Spotify Teardown

Goals and Steps

- clean data, verify data, get more data
- random sample of 10%, give tom half and we will verify by hand
- google for more artist gender data
- reach out to Horvat? have our methodology worked out first
- scrape incidence matrix for graph of artist network  

gethi for graph analysis

return to intersection of graphs of two artists later


Three questions:

1. numbers of genre labels for gender
2. predict gender for genre
3. predict gender from network graph

How do we move beyond the musicological knowledge, or how are these biases embedded in the algorithms, algorithms not neutral...

Are genders lumped together?