# Portfolio Assignment week 02

This week's focus is on manifold learning and text clustering. As part of the portfolio assignment, you are required to make a contribution to either the manifold learning case or the text clustering case. There are several options for your contribution, so you can choose the one that aligns with your learning style or interests the most

## Text clustering

Read, execute and analyse the code in the notebook tutorial_clustering_words. Then *choose one* of the assignments a), b) or c). 

c) Provide a text clustering solution with your own data of interest, you can follow a similar approach to the one in the tutorial_clustering_words notebook. 

Mind you that you are not allowed to copy code solutions without referencing.

### A. Get and clean the text

### Loading and getting data in the right form

In this assignment I will be working on a text clustering solution for articles analysing different methods used for differential gene expression analysis.

In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.feature_extraction import text
import re
import string
import requests
from bs4 import BeautifulSoup
import glob
import pandas as pd
from pathlib import Path
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/anuk-k/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/anuk-k/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/anuk-k/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/anuk-k/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to /home/anuk-k/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.feature_extraction import text
from nltk.corpus import stopwords
from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
import re
import string
import glob
import pandas as pd
from pathlib import Path

In [55]:
def html_to_text(url_list):
    # Create empty list for storing article text
    articles = []
    
    for url in url_list:
        # Extracting html using given url
        response = requests.get(url)
        response.raise_for_status()
        
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Kill all script and style elements
        for script in soup(["script", "style"]):
            script.extract()
        
        # Extract article text and append to articles list
        text = soup.get_text()
        articles.append(text)
    
    return articles

In [56]:
# Providing urls for brain expression related articles:
article1 = {"RNA-Seq differential expression analysis: An extended review and a software tool":"https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190152"}
article2 = {"Exaggerated false positives by popular differential expression methods when analyzing human population samples": "https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02648-4"}
article3 = {"A Comparative Study of Techniques for Differential Expression Analysis on RNA-Seq Data":"https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0103207"}
article4 = {"Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data":"https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-9-r95"}

article_list = [article1,article2,article3,article4]
url_list = []
title_list = []
for dict in article_list:
    for key, value in dict.items():
        title_list.append(key)
        url_list.append(value)
        
articles = html_to_text(url_list)

### Preprocessing and cleaning

To cluster text, the text needs to be preprocessed with a regular natural language processer. In this step punctuation, stopwords or other unwanted text elements are removed. 

In [57]:
def clean_text(text):
    '''
    Given a string from an article, removes new lines, capatilisation, punctuation, 
    references, and excess white space. Returns a string of cleaned text
    '''
    # removing new lines
    text = re.sub('\\n', ' ', text)
    # removing capitalised letters
    text = text.lower()
    # removing punctuation
    text = re.sub('\: |\? |\. |\, |\* |\\|\: | \- |\;|\:| \| | +', ' ', text)
    # removing references
    text = re.sub('\[\d+ |\,*\d*\]', ' ', text)
    text = re.sub('\(\d+\)', ' ', text)
    # removing brackets
    text = re.sub('\(|\)|\[|\]', '', text)
    # removing excess space
    text = re.sub('\s{2,}|\.{2,}', ' ', text)
    
    return text


In [58]:
# Noun extract and lemmatize function
def nouns(text):
    '''Given a string of text, tokenize the text 
    and pull out only the nouns.
    Author: Fenna Feenstra
    Date: 25-05-2023
    Source: https://github.com/fenna/BFVM23DATASCNC5/blob/main/Tutorials/tutorial_clustering_words.ipynb
    '''
    # create mask to isolate words that are nouns
    is_noun = lambda pos: pos[:2] == 'NN'
    # store function to split string of words 
    # into a list of words (tokens)
    tokenized = word_tokenize(text)
    # store function to lemmatize each word
    wordnet_lemmatizer = WordNetLemmatizer()
    # use list comprehension to lemmatize all words 
    # and create a list of all nouns
    all_nouns = [wordnet_lemmatizer.lemmatize(word) \
    for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    
    #return string of joined list of nouns
    return ' '.join(all_nouns)

In [59]:
clean_list = []
nouns_list = []

for text in articles:
    # Cleaning data and extracting nouns
    clean = clean_text(text)
    noun = nouns(clean)
    # Storing in list
    clean_list.append(clean)
    nouns_list.append(noun)
    
# Store in dataframe
data_nouns = pd.DataFrame(columns = ['title', 'text'])
df.title = title_list
df.text = nouns_list

In [60]:
df

Unnamed: 0,title,text,noun
0,RNA-Seq differential expression analysis: An e...,expression analysis review software tool skip ...,expression analysis review software tool skip ...
1,Exaggerated false positives by popular differe...,positive expression method population sample b...,positive expression method population sample b...
2,A Comparative Study of Techniques for Differen...,study technique expression analysis data skip ...,study technique expression analysis data skip ...
3,Comprehensive evaluation of differential gene ...,evaluation gene expression analysis method dat...,evaluation gene expression analysis method dat...


### Part B: The Document-Term Matrix (DTM)

To perform analyses we need to create a Document-Term Maxtrix. The Document-Term Matrix (DTM) represents the frequency of words (or terms) in a collection of documents. Each row in the matrix represents a document, and each column represents a word in the vocabulary. The value in each cell represents the frequency of the corresponding word in the corresponding document.


In [64]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

nltk.download('stopwords')

# Create a document-term matrix with only nouns
# Store TF-IDF Vectorizer
tv_noun = TfidfVectorizer(stop_words=text.ENGLISH_STOP_WORDS, ngram_range = (1,1), max_df = .8, min_df = .01)

# Fit and Transform speech noun text to a TF-IDF Doc-Term Matrix
#data_tv_noun = tv_noun.fit_transform(data_nouns.text)


[nltk_data] Downloading package stopwords to /home/anuk-k/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


AttributeError: 'str' object has no attribute 'ENGLISH_STOP_WORDS'

ValueError: empty vocabulary; perhaps the documents only contain stop words