## 410 Final Project: Generating Summaries for News Articles
Aaron Kuhstoss, Shalin Mehta, and Aleksandra Grigortsuk

### Imports

#### The below cell installs dependencies for finding English articles and make our data into a data frame for parsing

In [5]:
import pandas as pd
from langdetect import detect


In [6]:
df = pd.read_csv("Latest_News.csv")


#### In the initial data frame of the raw CSV data, the content of the articles is not in English for alot of rows, and has NaN values. We remove those NaN values below.

In [7]:
non_null_content = df[df['content'].notna()]


In [8]:
non_null_content.head()

Unnamed: 0,title,link,keywords,creator,video_url,description,content,pubDate,full_description,image_url,source_id
7,Nuovi massimi sui mercati azionari Usa mentre ...,https://www.doveinvestire.com/mercati-finanzia...,,['Dove Investire'],,Nuovi massimi sui mercati azionari Usa mentre ...,Nuovi massimi sui mercati azionari Usa mentre ...,2021-10-26 07:04:17,,,doveinvestire
8,บริษัทที่อยู่เบื้องหลัง &#8220;NBA Top Shot&#8...,https://siamblockchain.com/2021/10/26/pro-spor...,['26'],['Thongchai'],,Dapper Labs บริษัทที่อยู่เบื้องหลัง NBA Top Sh...,Dapper Labs บริษัทที่อยู่เบื้องหลัง NBA Top Sh...,2021-10-26 07:03:58,,https://siamblockchain.com/wp-content/uploads/...,siamblockchain
15,“พรพิมล”สมัครสมาชิก ภท.แล้ว”อนุทิน”ต้อนรับอบอุ่น,https://www.innnews.co.th/news/politics/news_2...,"['การเมือง', 'ข่าว', 'BreakingNews', 'INNNews'...",['Pavichaya Silpradit'],,"""พรพิมล""สมัครสมาชิก ภท.แล้ว ""อนุทิน"" ต้อนรับอบ...",“พรพิมล”สมัครสมาชิก ภท.แล้ว “อนุทิน” ต้อนรับอบ...,2021-10-26 07:02:09,,,innnews
16,ABP dumpt 15 miljard aan aandelen olie en gas,https://www.geenstijl.nl/5161756/goh-wij-dacht...,,['Ronaldo'],,,WAT GAAN WE DAARMEE DOEN? Een mooi bericht voo...,2021-10-26 07:02:00,,,geenstijl
18,Több településen megnőtt a légszennyezettség a...,https://444.hu/2021/10/26/tobb-telepulesen-meg...,"['Nemzeti Népegészségügyi Központ', 'légszenny...",['Kiss Imola'],,,A Nemzeti Népegészségügyi Központ (NNK) keddi ...,2021-10-26 07:01:59,A Nemzeti Népegészségügyi Központ (NNK) keddi ...,,444


#### Below, we remove non-english articles for efficient summarization.

In [9]:
def detect_language(text):
    try:
        return detect(text)
    except:
        return None

non_null_content['detected_language'] = non_null_content['content'].apply(detect_language)

english_articles = non_null_content[non_null_content['detected_language'] == 'en']

print(english_articles)

                                                   title  \
24            Napi trükkös matek feladat: Mi a megoldás?   
114    The best brown fashion pieces to get you throu...   
117    LOOK: Megan Thee Stallion’s college graduation...   
120    ‘Who has he fought?’ – Dillian Whyte slams Tys...   
131                        ADVISORY RUSSIA-EUROPE/NEWSER   
...                                                  ...   
86500  Jets vs. Patriots: Preview, predictions, what ...   
86517  Australian coalition govt junior partner gives...   
86539  Ikpeazu hails Abia Attorney General, Uche Ihed...   
86547  Rachel Riley warns friends not to do Strictly ...   
86551  Rachel Riley warns friends not to do Strictly ...   

                                                    link  \
24     https://keresztlabda.hu/2021/10/26/napi-trukko...   
114    https://metro.co.uk/2021/10/26/the-best-brown-...   
117    https://www.hitc.com/en-gb/2021/10/26/megan-th...   
120    https://www.thesun.co.uk/sport/1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  non_null_content['detected_language'] = non_null_content['content'].apply(detect_language)


#### We are able to check the word length for articles that are in English (saved in english_articles) to make sure that they are of adequate length for summarization.

In [10]:
english_articles['content'].str.split().apply(len).mean()

366.4366197183099

#### Below we have the head of the cleaned data from which we are going to extract articles from the "content" column.

In [16]:
head_english_words = english_articles.head()
head_english_words.iloc[1: , :]

Unnamed: 0,title,link,keywords,creator,video_url,description,content,pubDate,full_description,image_url,source_id,detected_language
114,The best brown fashion pieces to get you throu...,https://metro.co.uk/2021/10/26/the-best-brown-...,"['Fashion', 'Lifestyle', 'Shopping']",['Edaein O&#039;Connell'],,Choose sepia tones and never look back.,See the world in sepia (Picture: Weekday/NA-KD...,2021-10-26 06:53:26,When Adele released her new single ‘Easy On Me...,https://metro.co.uk/wp-content/uploads/2021/10...,metro,en
117,LOOK: Megan Thee Stallion’s college graduation...,https://www.hitc.com/en-gb/2021/10/26/megan-th...,"['Trending', 'college', 'graduation ceremony',...",['Disha Kandpal'],,Megan Thee Stallion is giving us all some much...,Megan Thee Stallion is giving us all some much...,2021-10-26 06:52:48,,,hitc,en
120,‘Who has he fought?’ – Dillian Whyte slams Tys...,https://www.thesun.co.uk/sport/16535318/dillia...,"['Boxing', 'Sport']",['Jack Figg'],,DILLIAN WHYTE has slammed claims Tyson Fury is...,DILLIAN WHYTE has slammed claims Tyson Fury is...,2021-10-26 06:52:14,,,thesun,en
131,ADVISORY RUSSIA-EUROPE/NEWSER,https://www.infobae.com/america/agencias/2021/...,,"['REUTERS, OCT 26']",,,Russia's Lavrov meets Norwegian and Finnish co...,2021-10-26 06:51:37,Russia's Lavrov meets Norwegian and Finnish co...,,infobae,en


# All of the above steps are part of the data pre-processing in which we cleaned and extracted the necessary data into a new data frame. Below we will do the first draft of our pipeline. We do not have evaluation metrics set up yet as we are still in the process of extracting summaries. 

In [12]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

#### This is the portion of the code dedicated to finding the first draft of the summaries based on the english_articles text passed into the 

In [13]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

def summarize(text, per): # The "per" parameter is to calculate the length of the returned summary as a percentage of the length of the original article 
    # Loading in the English tokenizer, tagger, parsing tool, some word vectors
    nlp = spacy.load('en_core_web_sm')
    # Process the text using spacy NLP pipeline
    doc = nlp(text)
    
    # Extracting and storing sentences from the document
    sentence_tokens = [sent for sent in doc.sents]
    
    # Creating a dictionary to hold word frequencies
    word_frequencies = {}
    for word in doc:
        word_text = word.text.lower()  # Converting words to lowercase
        # Filtering out stopwords and punctuation
        if word_text not in STOP_WORDS and word_text not in punctuation:
            # Counting word frequencies
            word_frequencies[word_text] = word_frequencies.get(word_text, 0) + 1
    
    # Finding the maximum word frequency
    max_frequency = max(word_frequencies.values())
    # Normalizing word frequencies
    for word in word_frequencies.keys():
        word_frequencies[word] /= max_frequency
    
    # Scoring sentences based on the normalized frequencies of the words they contain
    sentence_scores = {}
    for sent in sentence_tokens:
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                # Adding the word frequencies to find the sentence score
                sentence_scores[sent] = sentence_scores.get(sent, 0) + word_frequencies[word.text.lower()]
    
    # Determining the number of sentences to include in the summary
    select_length = int(len(sentence_tokens) * per)
    # Selecting the top sentences based on score
    summary_sentences = nlargest(select_length, sentence_scores, key=sentence_scores.get)
    
    # Joining the selected sentences to create the summary
    final_summary = ' '.join([sent.text for sent in summary_sentences])
    
    return final_summary  # Returning the summary


In [14]:
print(summarize(english_articles.iloc[2,6], 0.05))
print(english_articles.iloc[2,6])

October 19, 2021 something about megan thee stallion preparing for her college graduation while being one of the most in demand rappers/celebrities rn….. love to see it— lala ‍ 13 (@lalaloveontour) October 25, 2021 does megan thee stallion attend college physically like can she even do that loll shes soo popular ppl would bother her all the time— LOGAN ROY (@ripofffsasuke) October 25, 2021 not megan thee stallion graduating from college on the same exact day as me!!
Megan Thee Stallion is giving us all some much-needed inspiration as the Grammy winner graduated from college over the past weekend. The Texas native took some dank pictures from her commencement ceremony, to which she wore a stunning, bedazzled ‘real hot girl sh*t’ cap. While Megan looked stunning, the cap certainly became a show-stealer, while giving a not-so-subtle nod to her 2019 hit song Hot Girl Summer. SEE: Meet Micah Beals, actor who allegedly vandalized George Floyd’s statues View this post on Instagram A post shar