# HW3

## Text Processing

### Q1

1. Modify the code I wrote in lecture 8 with what you have learnt in lecture 9 and correctly tokenize the text both on the word and sentence level, and by removing the stopwords. Rewrite the `getSummary` function and all the other functions that it depends by maing these corrections.

2. Rewrite the code I wrote for `getKeywords` function making the same corrections.

3. Test your code from parts 1 and 2 on random articles from the Guardian.

4. Rewrite the `getSubjectGuardian` function for another newspaper in English, and test your code from part 1 and 2 on random articles from this new newspaper.

<hr>

In [1]:
import requests
import nltk
import regex as re
import numpy as np

In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer

Q1-1) I modify the subfunction of getSummary, processText(), by using tokenization instead of regex for getting sentences and words:
- I get sentences and words with tokenization instead of using regex.
- Then, the next step is lowering the sentences and the words.
- By filtering the list, the whitespaces strings and duplicating strings are removed from the word list.

In [3]:
def processText(text):
    raw_sentences = sent_tokenize(text)
    raw_words = word_tokenize(text)
    
    res_text = {'sentences': raw_sentences,
                    'words': raw_words}
    
    #
    res_text.update({'cleanedSentences': [re.sub(r'[^\p{Letter}\s]','',sentence.lower()) for sentence in res_text['sentences']]})
    
    #getting nonempty words
    tmp = [re.sub(r'[^\p{Letter}]','',word.lower()) for word in res_text['words']]
    words_nonempty = list(filter(lambda x: x != '', tmp))
    #removing duplicating words
    words_removedduplicates = []
    [words_removedduplicates.append(x) for x in words_nonempty if x not in words_removedduplicates]
    res_text.update({'cleanedWords': words_removedduplicates})
    

    return res_text

I test tokenization and removing whitespaces if it is working or not.

In [4]:
text = 'My name is Mr. Smith. I have a Ph.D. from M.I.T. and I work at I.B.M. Now, look at pg. 12 of your text.'
s = processText(text)['cleanedSentences']
w = processText(text)['cleanedWords']
(s,w)

(['my name is mr smith',
  'i have a phd from mit',
  'and i work at ibm',
  'now look at pg',
  ' of your text'],
 ['my',
  'name',
  'is',
  'mr',
  'smith',
  'i',
  'have',
  'a',
  'phd',
  'from',
  'mit',
  'and',
  'work',
  'at',
  'ibm',
  'now',
  'look',
  'pg',
  'of',
  'your',
  'text'])

In [5]:
def getMatrix(sentences):
    vectorizer = CountVectorizer()
    return vectorizer.fit_transform(sentences)

In [6]:
def getSummary(text,k):
    
    sentences = processText(text)['cleanedSentences']
    
    matrix = getMatrix(sentences)
    projection = PCA(n_components=1)
    weights = projection.fit_transform(matrix.toarray())
    res = list(zip(weights.transpose()[0],range(112),sentences))
    tmp = sorted(res,key=lambda x: x[0],reverse=True)[:k]
    return sorted(tmp, key=lambda x: x[1])

Q1-2) I rewrite the function getKeywords:
- I get the words not in stop words sw.

In [7]:
def getKeywords(text,sw,k):
    sentences = processText(text)['cleanedSentences']
    
    words = processText(text)['cleanedWords']
    words_notstop = [w for w in words if not w in sw]

    vectorizer = CountVectorizer(stop_words=sw)
    matrix = vectorizer.fit_transform(sentences)
    
    projection = PCA(n_components=1)
    tmp = projection.fit_transform(matrix.transpose().toarray())
    weights = tmp.transpose()[0]
    
    return sorted(zip(weights,words_notstop),key=lambda x: x[0], reverse=True)[:k]

I import the function parse for pulling the text of a link as a dictionary type.

In [8]:
from xmltodict import parse

Q1-3) For a article from The Guardian, I will test the functions that i wrote earlier, getSummary and getKeywords.

This function will be used for pulling the RSS feeds of The Guardian.

In [9]:
def getSubjectGuardian(subject):
    with requests.get(f'https://www.theguardian.com/{subject}/rss') as link:
        raw = parse(link.text)
    return raw['rss']['channel']['item']

With the use of BeautifulSoup for extracting contents for each tag that is 'p'.

In [10]:
from bs4 import BeautifulSoup

def getText(url):
    with requests.get(url) as link:
        tags = BeautifulSoup(link.content,'html.parser')
        
    return ' '.join([x.text for x in tags.find_all('p')])

We have random articles, one from The Guardian.

In [11]:
articles_theGuardian = getSubjectGuardian('sport')
n = len(articles_theGuardian)
k = np.random.randint(n)
article_theGuardian = getText(articles_theGuardian[k]['link'])
articletitle_theGuardian = articles_theGuardian[k]['title']

Q1-4) I overwrite the function for the RSS feeds of The Sun Daily. Just, the link was changed.

In [12]:
def getSubjectGuardian2(subject):
    with requests.get(f'https://www.thesundaily.my/rss/{subject}.xml') as link:
        raw = parse(link.text)
    return raw['rss']['channel']['item']

And we have random articles, one from The Sun.

In [13]:
articles_theSun = getSubjectGuardian2('Sport')
m = len(articles_theSun)
l = np.random.randint(m)
article_theSun = getText(articles_theSun[l]['link'])
articletitle_theSun = articles_theSun[l]['title']

** Note that, I also get the titles of the articles for better understanding if the results are relevant.

In [14]:
(articletitle_theGuardian,articletitle_theSun)

('Bradley Wiggins’s pain shows us that sport needs to prioritise welfare too',
 'City’s Fernandinho vows no let-up against Real Madrid')

The Guardian Section : 

In [15]:
getSummary(article_theGuardian,3)

[(8.508084523725085,
  9,
  'first that safe sport must be the first priority but for many it still isnt second that certain aspects of highperformance environments that are often admired  from behaviours of obsessive dedication placing winning above everything else and the power balance within coachathlete relationships  contribute to a darker side of sport that brings heavy longterm costs and third that our focus on heroic narratives throughout sport is at best misleading and at times deeply damaging distracting us from the real stories of the people behind the hero masks'),
 (4.469335379744962,
  17,
  'in grassroots clubs parents often brought up in a culture where having a medal brought status recognition and opportunity easily buy in to wanting that for their children not realising the costs or that there is so much more to gain through sport that would outlast trophies and coaches who know their jobs depend on shortterm results are disincentivised from investing in the longterm 

In [16]:
swEN = set(stopwords.words('english'))
getKeywords(article_theGuardian,swEN,15)

[(5.070245012128429, 'hold'),
 (1.267126281331729, 'braver'),
 (1.1556118227821273, 'holistic'),
 (1.1164323001408554, 'best'),
 (1.0952928906645596, 'pam'),
 (1.0952928906645596, 'foundation'),
 (1.0928430240538194, 'models'),
 (0.8787341921021443, 'paris'),
 (0.8682950551356036, 'ready'),
 (0.8600490449473461, 'whether'),
 (0.6368830264285261, 'defines'),
 (0.6110418129223201, 'shortterm'),
 (0.6063550641021692, 'broader'),
 (0.6046995151130318, 'order'),
 (0.5859961376714278, 'social')]

The Sun section :

In [17]:
getSummary(article_theSun,3)

[(-0.16576711418827936,
  1,
  'city held a twogoal lead three times at the etihad but real having come from behind in the previous rounds against paris saintgermain and chelsea responded yet again'),
 (10.315086599170332,
  7,
  'they have top players and in the meantime they can create chances and score goals so you have to be careful all the time the brazil international who came off the bench to play at rightback for the injured john stones in the first half said city were disappointed to concede three goals but the win would give them confidence for the return fixture at the bernabeu on may '),
 (0.31853313104560177,
  9,
  'fernandinho said it is the same as always  when we win we celebrate and then focus on the next game in the premier league')]

In [18]:
getKeywords(article_theSun,swEN,15)

[(1.5214518391540317, 'reason'),
 (1.1979363766939983, 'twogoal'),
 (0.9733000491303266, 'behind'),
 (0.9733000491303266, 'rounds'),
 (0.8163167337317945, 'always'),
 (0.7336309930084619, 'international'),
 (0.7122431233823232, 'return'),
 (0.7122431233823232, 'thinking'),
 (0.6889044989422013, 'madrid'),
 (0.6889044989422012, 'backyard'),
 (0.6889044989422012, 'first'),
 (0.6889044989422012, 'leg'),
 (0.6889044989422012, 'league'),
 (0.6889044989422012, 'previous'),
 (0.6889044989422012, 'responded')]

### Q2

Write a function that returns all named entities (proper names, country names, corporation names only) from a URL. Function should take the URL as the input and must return the list of named entities from that URL. Test your code on random articles from the Guardian. Don't use the NLTK's NER that I demonstrated during the lecture. Use the SpaCY's NER function.

I downloaded spacy with this code :
```import sys
!{sys.executable} -m pip install spacy
!{sys.executable} -m spacy download en```

I have used spacy.explain(token.label_) for pulling Proper Names, Country Names and Corporation Names. Then, I decided to take the tokens whose label_ are PERSON, GPE or ORG.

In [19]:
import spacy

def SpacyNER(text):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    res = [ (token.text,token.label_,spacy.explain(token.label_)) for token in doc.ents if token.label_ in ['GPE','PERSON','ORG']]
    
    #sentences = sent_tokenize(text)
    #res = []
    #for i in range(len(sentences)):
    #    doc = nlp(sentences[i])
    #   res.append([ (token.text,token.label_,spacy.explain(token.label_)) for token in doc.ents])


    return res

Here is a test for a random article from The Guardian :

In [20]:
articles_theGuardian = getSubjectGuardian('sport')
n = len(articles_theGuardian)
k = np.random.randint(n)
article_theGuardian = getText(articles_theGuardian[k]['link'])

SpacyNER(article_theGuardian)

[('LinkedTok', 'ORG', 'Companies, agencies, institutions, etc.'),
 ('Fiver', 'ORG', 'Companies, agencies, institutions, etc.'),
 ('The Fiver', 'ORG', 'Companies, agencies, institutions, etc.'),
 ('Ralf Rangnick', 'PERSON', 'People, including fictional'),
 ('Big Red', 'ORG', 'Companies, agencies, institutions, etc.'),
 ('Old Trafford', 'PERSON', 'People, including fictional'),
 ('Thomas Frank', 'PERSON', 'People, including fictional'),
 ('Ole Gunnar Solskjær', 'PERSON', 'People, including fictional'),
 ('Premier League', 'ORG', 'Companies, agencies, institutions, etc.'),
 ('Watford', 'ORG', 'Companies, agencies, institutions, etc.'),
 ('Chelsea', 'ORG', 'Companies, agencies, institutions, etc.'),
 ('Cristiano Ronaldo', 'PERSON', 'People, including fictional'),
 ('United’s', 'ORG', 'Companies, agencies, institutions, etc.'),
 ('Ralfie', 'ORG', 'Companies, agencies, institutions, etc.'),
 ('United', 'ORG', 'Companies, agencies, institutions, etc.'),
 ('United', 'ORG', 'Companies, agencies

### Q3

1. Write a function that returns the most positive and the most negative sentences from a text. The function must take the text as the input and must return a 2-tuple: the first element as the most positive and the second as the most negative sentence with their polarity scores.

2. Test your function on random articles from the Guardian.

In [21]:
from nltk.sentiment import SentimentIntensityAnalyzer
#nltk.download('vader_lexicon')

Q3-1) Variable res contains the sentences and their polarity scores. For sentences, I create 2 lists that hold their positive and negative scores. Finally, using max() on lists with respect to their scores gives me the most positive one and the most negative one.

In [22]:
def Sentiments(text):
    analyzer = SentimentIntensityAnalyzer()
    analyzer.polarity_scores(text)
    sentences = sent_tokenize(text)
    res = [(x,analyzer.polarity_scores(x)) for x in sentences]
    
    # Getting list of Negative Sentences and Positive Sentences
    list_neg = [ (sentence[0] , sentence[1]['neg']) for sentence in res]
    list_pos = [ (sentence[0] , sentence[1]['pos']) for sentence in res]
    
    # Getting most negative and most positive
    max_neg = max(list_neg,key=lambda x:x[1])
    max_pos = max(list_pos,key=lambda x:x[1])
    return ({'most positive':max_pos},
            {'most negative':max_neg})

Q3-2) I test my function :

In [31]:
articles_theGuardian = getSubjectGuardian('sport')
n = len(articles_theGuardian)
k = np.random.randint(n)
article_theGuardian = getText(articles_theGuardian[k]['link'])

Sentiments(article_theGuardian)

({'most positive': ('That is what inspiring the next generation surely means.',
   0.449)},
 {'most negative': ('Why the delay?', 0.535)})

We check whether it is true or false.

In [32]:
text = article_theGuardian
analyzer = SentimentIntensityAnalyzer()
analyzer.polarity_scores(text)
sentences = sent_tokenize(text)
[(x,analyzer.polarity_scores(x)) for x in sentences]

[('Coaches and leaders must apply equal rigour to creating safe environments as they have long done to winning It’s hard to recall Bradley Wiggins sitting on his throne at Hampton Court, riding into Paris with the yellow jersey, or ringing the bell to start the 2012 London Olympics opening ceremony without now seeing through to the secret he was carrying inside him on all those occasions.',
  {'neg': 0.019, 'neu': 0.862, 'pos': 0.118, 'compound': 0.7964}),
 ('It took Wiggins nearly three decades to share his experience of grooming at the age of 13.',
  {'neg': 0.0, 'neu': 0.879, 'pos': 0.121, 'compound': 0.296}),
 ('It took Pam Shriver four decades to share her story of an emotionally abusive relationship with a coach.',
  {'neg': 0.196, 'neu': 0.701, 'pos': 0.103, 'compound': -0.4588}),
 ('What about the other stories that haven’t yet been heard?',
  {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}),
 ('While there is shock and sympathy, we must go further to draw out what needs 