<a href="https://colab.research.google.com/github/daniellekurtin/CodingAnOpenScienceSong/blob/main/NLP_Workshop_to_Code_and_Open_Science_Song.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Mini Hack Consortium Workshop 1:
##Using topics generated from Natural Language Processing to inspire lyrics for an Open Science Song 


**Objective**
You will collaborate with your group to work through this Google Colab notebook, where you will use Python to help generate lyrics for a song about open science. You will have a mentor with you who is familar with this workshop and is there to help you answer questions. 

**Part 1:**
We will introduce NLP: what it does, how it works, and how we're going to use it today. 

**Part 2:**
We will introduce the text we're using to generate inspiration for the lyrics for an Open Science Song. We are using a compiled body of n=? papers, studies, and editorials on open and reproducible research. These papers can all be found at this link:

**Part 3:**
We will learn about data cleaning and preprocessing for NLP, such as text tokenization, text normalization, stemming, part-of-speech tagging. 

**Part 4:**
We will use topic modelling to extract topics from our text corpus. 

**Intermission:** We will take a __ minute break for lunch.

**Part 5:**
We will break into different groups to code an open science song togeter. Each breakout room will have a mentor who will help you use the topics generated from the NLP to craft some of an open science song.

**Part 6:**
Something social, if we'd like

**References :**
The 1-? parts of this workshop are credited to


The ?-? parts of this workshop are credited to Maria Balaet: 

*This event was organized by: Danielle Kurtin, Maria Balaet, Yorguin Mantilla Ramos,  Zhaoying Yu, Roonak Rezvani, Daniel Brady, Violeta Menendez Gonzalez*




If you want to mount this Colab workbook on your drive, run the line of code below and follow the prompts. 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Importing packages

Packages are one of the main reasons Python is an incredible lanugage- we can pull from packages that perform certain functions. Today we'll use the NLTK, or Natural Lanugage Toolkit. To learn more about NLTK, please check out their [documentation](https://www.nltk.org/). 

However, NLTK is not the only NLP package. Another, more advanced one is called spaCy. Feel free to check out their [documentation](https://spacy.io/), too!

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Tokenization

Tokenization is breaks up a string of words into semantically useful units called "tokens". This turns words and letters into a mathematical format, sometimes known as a "bag-of-words", where each word becomes a token, which in turn, is a data point.

Tokenization can come in two forms- sentence or word tokenization. Sentence tokenization splits sentences within a text, usually by identifying stops (or periods, if you're used to American terminology!). Word tokenization often uses spaces between words to seperate words into tokens. However, more advanced tokenisation functions use collocations to identify words that often go together, like "New" and "York". 

Here is an example of first performing sentence tokenization on text, then word tokenization. 

"Sadhana is a baker. She enjoys making cakes."

Sentence tokenization:

"Sadhana is a baker", "She enjoys making cakes."

Word tokenization:

"Sadhana" "is" "a" "baker" "She" "enjoys" "making" "cakes"

Tokenization is an important first step in NLP, because a lot of the following steps perform operations on individual words to regularize them. 





In [None]:
#Tokenization -- Paragraphs into sentences;
from nltk.tokenize import sent_tokenize 
  
text = "Hello All. Welcome to medium. This article is about NLP using NLTK."
print("SENTENCE AS TOKENS:")
print(sent_tokenize(text))
print("No of Sentence Tokens:",len(sent_tokenize(text)))

SENTENCE AS TOKENS:
['Hello All.', 'Welcome to medium.', 'This article is about NLP using NLTK.']
No of Sentence Tokens: 3


In [None]:
import nltk.data 
  
german_tokenizer = nltk.data.load('tokenizers/punkt/PY3/german.pickle') 
  
text = 'Wie geht es Ihnen? Mir geht es gut.'
german_tokenizer.tokenize(text) 


['Wie geht es Ihnen?', 'Mir geht es gut.']

In [None]:
#Tokenization --Text into word tokens;
from nltk.tokenize import word_tokenize 
  
text = "Hello All. Welcome to medium. This article is about NLP using NLTK. Subscribe with $4.00. "
print("SENTENCE AS TOKENS:")
print(word_tokenize(text))
print("No of Sentence Tokens:",len(word_tokenize(text)))


SENTENCE AS TOKENS:
['Hello', 'All', '.', 'Welcome', 'to', 'medium', '.', 'This', 'article', 'is', 'about', 'NLP', 'using', 'NLTK', '.', 'Subscribe', 'with', '$', '4.00', '.']
No of Sentence Tokens: 20


In [None]:
from nltk.tokenize import TreebankWordTokenizer 
text = "Hello All. Welcome to medium. This article is about NLP using NLTK. Subscribe with $4.00. "
tokenizer = TreebankWordTokenizer() 
tokenizer.tokenize(text) 

['Hello',
 'All.',
 'Welcome',
 'to',
 'medium.',
 'This',
 'article',
 'is',
 'about',
 'NLP',
 'using',
 'NLTK.',
 'Subscribe',
 'with',
 '$',
 '4.00',
 '.']

## n-grams 

n-grams are a way of looking at contiguous sequences of words in a sentence, where the "n" specifies the number of words at a time. This can help identify repeated phrases and their importance in the corpus- for example, in our corpus, the words "open" and "science" will likely appear as the 2-gram "open science", or the 4-gram "open and reproducibile science".

Tokens do not usually have any conditions on contiguity, which is why n-grams come in handy.

In [None]:
#Using pure python

import re

def generate_ngrams(text, n):
    # Convert to lowercases
    text = text.lower()
    
    # Replace all none alphanumeric characters with spaces
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
    
    # Break sentence in the token, remove empty tokens
    tokens = [token for token in text.split(" ") if token != ""]
    
    # Use the zip function to help us generate n-grams
    # Concatentate the tokens into ngrams and return
    ngrams = zip(*[tokens[i:] for i in range(n)])
    return [" ".join(ngram) for ngram in ngrams]

text = "Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP."
print(text)
generate_ngrams(text, n=3)

Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP.


['hello everyone welcome',
 'everyone welcome to',
 'welcome to intro',
 'to intro to',
 'intro to machine',
 'to machine learning',
 'machine learning applications',
 'learning applications we',
 'applications we are',
 'we are now',
 'are now learning',
 'now learning important',
 'learning important basics',
 'important basics of',
 'basics of nlp']

In [None]:
#Using NLTK import ngrams

import re
from nltk.util import ngrams

text = text.lower()
text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
tokens = [token for token in text.split(" ") if token != ""]
output = list(ngrams(tokens, 3))
print(output)

[('hello', 'everyone', 'welcome'), ('everyone', 'welcome', 'to'), ('welcome', 'to', 'intro'), ('to', 'intro', 'to'), ('intro', 'to', 'machine'), ('to', 'machine', 'learning'), ('machine', 'learning', 'applications'), ('learning', 'applications', 'we'), ('applications', 'we', 'are'), ('we', 'are', 'now'), ('are', 'now', 'learning'), ('now', 'learning', 'important'), ('learning', 'important', 'basics'), ('important', 'basics', 'of'), ('basics', 'of', 'nlp')]


## Text normalization and stemming. 

Text normalization helps turn all letters in a token into either uppercase or lowercase letters. 

Stemming removes affixes (prefixes and suffixes) from words, and put them in their basic form.  For example, the word "running" would be converted to "run"+"ing". Here are some more examples:

"noncompliant"-->"non"+"compliant"
"fixed" --> "fix"+"ed"
"transformation"-->"transform"+"ation"

You can see how this is better done on individual words than on an entire sentence, or entire body of text! 


In [None]:
#Text Normalization

#Case Conversion
text = "Hello All. Welcome to medium. This article is about NLP using NLTK. Subscribe with $4.00."
lowert = text.lower()
uppert = text.upper()

print("To Lower Case:",lowert)
print("To Upper Case:",uppert)


To Lower Case: hello all. welcome to medium. this article is about nlp using nltk. subscribe with $4.00.
To Upper Case: HELLO ALL. WELCOME TO MEDIUM. THIS ARTICLE IS ABOUT NLP USING NLTK. SUBSCRIBE WITH $4.00.


In [None]:
#stemming
#Porter stemmer is a famous stemming approach
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
   
ps = PorterStemmer()
sentence = "It would be unfair to demand that people cease pirating files when those same people aren't paid for their participation in very lucrative network schemes. Ordinary people are relentlessly spied on, and not compensated for information taken from them. While I'd like to see everyone eventually pay for music and the like, I'd not ask for it until there's reciprocity."

sent = word_tokenize(sentence)
print("After Word Tokenization:\n",sent)
print("Total No of Word Tokens: ",len(sent))

ps_sent = [ps.stem(words_sent) for words_sent in sent]
print(ps_sent)

After Word Tokenization:
 ['It', 'would', 'be', 'unfair', 'to', 'demand', 'that', 'people', 'cease', 'pirating', 'files', 'when', 'those', 'same', 'people', 'are', "n't", 'paid', 'for', 'their', 'participation', 'in', 'very', 'lucrative', 'network', 'schemes', '.', 'Ordinary', 'people', 'are', 'relentlessly', 'spied', 'on', ',', 'and', 'not', 'compensated', 'for', 'information', 'taken', 'from', 'them', '.', 'While', 'I', "'d", 'like', 'to', 'see', 'everyone', 'eventually', 'pay', 'for', 'music', 'and', 'the', 'like', ',', 'I', "'d", 'not', 'ask', 'for', 'it', 'until', 'there', "'s", 'reciprocity', '.']
Total No of Word Tokens:  69
['It', 'would', 'be', 'unfair', 'to', 'demand', 'that', 'peopl', 'ceas', 'pirat', 'file', 'when', 'those', 'same', 'peopl', 'are', "n't", 'paid', 'for', 'their', 'particip', 'in', 'veri', 'lucr', 'network', 'scheme', '.', 'ordinari', 'peopl', 'are', 'relentlessli', 'spi', 'on', ',', 'and', 'not', 'compens', 'for', 'inform', 'taken', 'from', 'them', '.', 'whi

In [None]:
#Porter stemmer is a famous stemming approach

from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
ps = PorterStemmer() 
 
words = ["hike", "hikes", "hiked", "hiking", "hikers", "hiker", "universal", "universe", "university","alumnus", "alumni", "alumnae"] 
  
for w in words: 
    print(w, " : ", ps.stem(w)) 

hike  :  hike
hikes  :  hike
hiked  :  hike
hiking  :  hike
hikers  :  hiker
hiker  :  hiker
universal  :  univers
universe  :  univers
university  :  univers
alumnus  :  alumnu
alumni  :  alumni
alumnae  :  alumna


## Lemmatization

This helps turn words, particularily verbs, into their basic form. For example, "swim", "swam", and "swum" would all be converted to "swim"

In [None]:

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('wordnet')

lemma = WordNetLemmatizer()
text = "Live like a leaf. Leave the world as the leaves fall from the tree."
words = word_tokenize(text)
for w in words: 
    print(w, " : ", lemma.lemmatize(w)) 

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
Live  :  Live
like  :  like
a  :  a
leaf  :  leaf
.  :  .
Leave  :  Leave
the  :  the
world  :  world
as  :  a
the  :  the
leaves  :  leaf
fall  :  fall
from  :  from
the  :  the
tree  :  tree
.  :  .


In [None]:
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
import re
   
ps = PorterStemmer() 
text = "Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP."
print(text)


#Tokenize and stem the words
text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
tokens = [token for token in text.split(" ") if token != ""]

i=0
while i<len(tokens):
  tokens[i]=ps.stem(tokens[i])
  i=i+1

#merge all the tokens to form a long text sequence 
text2 = ' '.join(tokens) 

print(text2)

Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP.
hello everyon welcom to intro to machin learn applic We are now learn import basic of nlp


In [None]:
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize 
import re
   
ss = SnowballStemmer("english")
text = "Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP."
print(text)


#Tokenize and stem the words
text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
tokens = [token for token in text.split(" ") if token != ""]

i=0
while i<len(tokens):
  tokens[i]=ss.stem(tokens[i])
  i=i+1

#merge all the tokens to form a long text sequence 
text2 = ' '.join(tokens) 

print(text2)

Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP.
hello everyon welcom to intro to machin learn applic we are now learn import basic of nlp


## Stopword removal

Removing stopwords helps clear the amount of "clutter" in a sentance. It removes words such as to, then, is, about, etc. 

In [None]:
#Stopwords removal 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

text = "Hello All. Welcome to medium. This article is about NLP using NLTK."

stop_words = set(stopwords.words('english')) 
word_tokens = word_tokenize(text) 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
  
print(word_tokens) 
print(filtered_sentence) 

['Hello', 'All', '.', 'Welcome', 'to', 'medium', '.', 'This', 'article', 'is', 'about', 'NLP', 'using', 'NLTK', '.']
['Hello', 'All', '.', 'Welcome', 'medium', '.', 'This', 'article', 'NLP', 'using', 'NLTK', '.']


## Topic extraction 

So, we've gone through a lot of work just to get our corpus ready for topic modelling. We've tokenized, stemmed, lemminized, and removed the stopwords from our text- now we're ready to extract topics!

To do this, we'll use the following methods:

1. Implementation of LDA as topic modelling algorithm
2. Rating of top 20 sentences per topic
3. Interpretation of top opinions to label/name the topic

Import packages

In [None]:
## Import packages and read in data

from time import time
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import train_test_split
import tmtoolkit
import numpy as np
import itertools
import logging
import pandas as pd
import seaborn as sns


data148=pd.read_pickle('data148_prepared_for_gensim')
d148=data148.R148_R_3_lemma.to_list()
data=d148


ModuleNotFoundError: ignored

In [None]:
#coherence of topics with no held out set
#Construct df

# Topics range
min_topics = 1
max_topics = 5
step_size = 1
topics_range = range(min_topics, max_topics, step_size)

#data = LIST of cleaned text (no symbols, no numbers, not stopwords)
#dictionary = ??


In [None]:

classifier_results = {'Topics': [], "cv_Coherence_avg": [], "umass_Coherence_avg":[], "cuci_Coherence_avg":[], "cnpmi_Coherence_avg":[]}
    
# Create Dictionary
id2word = corpora.Dictionary(data)

# Create Corpus
texts = data

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

In [None]:
for i in topics_range:
    np.random.seed(100) #doesn t work and i have no idea why... i put an issue here https://github.com/RaRe-Technologies/gensim/issues/3218
    n_topics = i
              
    t0 = time()

    #Model
    lda_model = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                            id2word=id2word,
                                                                num_topics=n_topics, 
                                                                random_state=100,
                                                                #chunksize=200,
                                                                #passes=100,
                                                                workers=20,
                                                                iterations=150,
                                                                minimum_probability=0)
                                                                 
    lda_model.save('test/'+directory+'lda_model_n_topics_'+str(i))
        
    np.random.seed(100)

    cv_coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=id2word, coherence='c_v')
    cv_coherence_total = cv_coherence_model_lda.get_coherence()

    ## We are using CV coherence due to our personal research experience, but there are other methods to calculate coherence. Have a play with the following three methods, and compare the outcome to those generated using cv coherence
    #umass_coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=id2word, coherence='u_mass')
    #umass_coherence_total = umass_coherence_model_lda.get_coherence()
            
    #cuci_coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=id2word, coherence='c_uci')
    #cuci_coherence_total = cuci_coherence_model_lda.get_coherence()
    #    
    #cnpmi_coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=id2word, coherence='c_npmi')
    #cnpmi_coherence_total = cnpmi_coherence_model_lda.get_coherence()
              
    classifier_results['Topics'].append(i)
    classifier_results['cv_Coherence_avg'].append(cv_coherence_total)
    classifier_results['umass_Coherence_avg'].append(umass_coherence_total)
    classifier_results['cuci_Coherence_avg'].append(cuci_coherence_total)
    classifier_results['cnpmi_Coherence_avg'].append(cnpmi_coherence_total)

In [None]:

np.random.seed(100)
opinions = data[col]

# Create Dictionary
id2word = corpora.Dictionary(opinions)

# Create Corpus
texts = opinions

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

#model
lda_model = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                id2word=id2word,
                                                num_topics=n_topics, 
                                                random_state=100,
                                                chunksize=200,
                                                passes=100,
                                                workers=20,
                                                iterations=150,
                                                minimum_probability=0)
    
#model_topics = lda_model.show_topics(formatted=True)
#print(model_topics)
    
#get topic distribution probabilities
all_topics = lda_model.get_document_topics(corpus, minimum_probability=0.0)
doc_topic_dist_proc = gensim.matutils.corpus2csc(all_topics)
doc_topic_dist_numpy = doc_topic_dist_proc.T.toarray()
#doc_topic_dist = pd.DataFrame(doc_topic_dist_numpy)

#get dominant topic
topicnames = ['Topic_' + str(i+1) for i in range(0,n_topics,1)]
all_topics_df = pd.DataFrame(doc_topic_dist_numpy,columns=topicnames)
all_topics_df['dominant_topic_contribution'] = all_topics_df.max(axis = 1) 
all_topics_df['dominant_topic'] = np.argmax(all_topics_df.values, axis=1)
all_topics_df['dominant_topic_name'] = "Topic "+(all_topics_df['dominant_topic']+1).astype(str)
    
#append df 
all_topics_df_full = all_topics_df.merge(data[['user_id',col_original]],left_index=True,right_index=True)
all_topics_df_full.to_csv('all_labelled_opinions_'+col_original+'_gensim_overfitted.csv')
sneak_peak = all_topics_df_full[[col_original]]
sneak_peak = sneak_peak.sample(frac=1).reset_index(drop=True)
sneak_peak.to_csv('quick_data_sneak_peak_'+col_original+'.csv')
    
#get the top 20 opinions per topic df and output it as csv
top_opinions_df = all_topics_df_full.groupby('dominant_topic_name').apply(lambda x: x.nlargest(n_opinions, 'dominant_topic_contribution')).reset_index(drop=True)
top_opinions_df.to_csv('labelled_validation_'+col_original+'_gensim_overfitted.csv')


In [None]:

#output .csv for validation with TOP opinions, dominant topic name and column to insert validator opinion
validation_sheet1 = top_opinions_df[[col_original,'dominant_topic_name']]
validation_sheet1['validation']='yes/no'
validation_sheet1=validation_sheet1.sample(frac=1).reset_index(drop=True) #this shuffles the order!!
validation_sheet1.to_csv('validation_top_opinions_'+col_original+'_gensim_overfitted.csv')

#output .csv for validation with RANDOM opinions, dominant topic name, and column to insert validator opinion
#get 20 random opinions per topic from all_topic_df_full

validation_sheet2 = []# {'opinion':[],'dominant_topic_name':[], 'validation':[]}
for topic in all_topics_df_full.dominant_topic_name.unique():
    random_20 = all_topics_df_full[all_topics_df_full['dominant_topic_name']==topic]
    random_20 = random_20.sample(n=20)
    validation_sheet2.append(random_20)
    #random_20 = pd.DataFrame(random_20,columns=all_topics_df_full.columns)
validation_sheet2 = pd.concat(validation_sheet2, ignore_index=True)
    #validation_sheet2 = pd.DataFrame(validation_sheet2,columns=all_topics_df.columns)

validation_sheet2 = validation_sheet2[[col_original,'dominant_topic_name']]
validation_sheet2['validation']='yes/no'
validation_sheet2=validation_sheet2.sample(frac=1).reset_index(drop=True)
pd.DataFrame(validation_sheet2).to_csv('validation_random_opinions_'+col_original+'_gensim_overfitted.csv')

#print top words per topic (easier to make the validation for top words manual)

for idx, topic in lda_model.show_topics(formatted=False, num_words= 20):
    print('Topic: {} \nWords: {}'.format(idx, [w[0] for w in topic]))



In [None]:
result = overfitted_coh(data,'test/')
pd.DataFrame(result).to_csv('result_test.csv')
sns.lineplot(x='Topics', y='cv_Coherence_avg', data=pd.DataFrame(result))

In [None]:
doc_topic_dist = get_topics_gensim(data148,'R148_R_3_lemma','R148_R_3',7,20,20)