## 410 Final Project: Generating Summaries for News Articles
Aaron Kuhstoss, Shalin Mehta, and Aleksandra Grigortsuk

### Imports

In [49]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from langdetect import detect

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

 ### Data Preprocessing
 

In [2]:
# import the dataset
df = pd.read_csv("Latest_News.csv")

# filtering dataset with English articles and non-NA 
def detect_language(text):
    try:
        return detect(text)
    except:
        return None

# Apply the language detection function to df
df['detected_language'] = df['content'].apply(detect_language)
english_articles = df[df['detected_language'] == 'en']




In [30]:
print(len(english_articles))
print(english_articles['content'].str.len().mean())
print(english_articles['description'].str.len().mean())

print(english_articles['content'].isnull().sum())
print(english_articles['title'].isnull().sum())

# every article has content and title
# 6741 articles
# averege size of content is 2200 characters

(english_articles.head())

6741
2236.156653315532
237.57382497230574
0
0


Unnamed: 0,title,link,keywords,creator,video_url,description,content,pubDate,full_description,image_url,source_id,detected_language,d_language
24,Napi trükkös matek feladat: Mi a megoldás?,https://keresztlabda.hu/2021/10/26/napi-trukko...,"['Fejtörő', 'Matek', 'Napi Feladat', 'feladat'...",['Adam'],,"Nagyon sok fajta kvízünk, vagy épp feladatunk ...","Nagyon sok fajta kvízünk, vagy épp feladatunk ...",2021-10-26 07:00:57,"Nagyon sok fajta kvízünk , vagy épp feladatunk...",,keresztlabda,en,en
114,The best brown fashion pieces to get you throu...,https://metro.co.uk/2021/10/26/the-best-brown-...,"['Fashion', 'Lifestyle', 'Shopping']",['Edaein O&#039;Connell'],,Choose sepia tones and never look back.,See the world in sepia (Picture: Weekday/NA-KD...,2021-10-26 06:53:26,When Adele released her new single ‘Easy On Me...,https://metro.co.uk/wp-content/uploads/2021/10...,metro,en,en
117,LOOK: Megan Thee Stallion’s college graduation...,https://www.hitc.com/en-gb/2021/10/26/megan-th...,"['Trending', 'college', 'graduation ceremony',...",['Disha Kandpal'],,Megan Thee Stallion is giving us all some much...,Megan Thee Stallion is giving us all some much...,2021-10-26 06:52:48,,,hitc,en,en
120,‘Who has he fought?’ – Dillian Whyte slams Tys...,https://www.thesun.co.uk/sport/16535318/dillia...,"['Boxing', 'Sport']",['Jack Figg'],,DILLIAN WHYTE has slammed claims Tyson Fury is...,DILLIAN WHYTE has slammed claims Tyson Fury is...,2021-10-26 06:52:14,,,thesun,en,en
131,ADVISORY RUSSIA-EUROPE/NEWSER,https://www.infobae.com/america/agencias/2021/...,,"['REUTERS, OCT 26']",,,Russia's Lavrov meets Norwegian and Finnish co...,2021-10-26 06:51:37,Russia's Lavrov meets Norwegian and Finnish co...,,infobae,en,en


### Pipeline Construction
1. Summarization pipeline
2. Categorization pipline (optional)
*does not need to be fully implemented by 11/1 milestone*

In [37]:
def summarize(text, per):
    nlp = spacy.load('en_core_web_sm')
    doc= nlp(text)
    tokens=[token.text for token in doc]
    word_frequencies={}
    for word in doc:
        if word.text.lower() not in list(STOP_WORDS):
            if word.text.lower() not in punctuation:
                if word.text not in word_frequencies.keys():
                    word_frequencies[word.text] = 1
                else:
                    word_frequencies[word.text] += 1
    max_frequency=max(word_frequencies.values())
    for word in word_frequencies.keys():
        word_frequencies[word]=word_frequencies[word]/max_frequency
    sentence_tokens= [sent for sent in doc.sents]
    sentence_scores = {}
    for sent in sentence_tokens:
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if sent not in sentence_scores.keys():                            
                    sentence_scores[sent]=word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent]+=word_frequencies[word.text.lower()]
    select_length=int(len(sentence_tokens)*per)
    summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)
    final_summary=[word.text for word in summary]
    summary=''.join(final_summary)
    return summary

In [54]:
print(summarize(english_articles.iloc[2,6], 0.05))

print(english_articles.iloc[2,6])

October 19, 2021 something about megan thee stallion preparing for her college graduation while being one of the most in demand rappers/celebrities rn….. love to see it— lala ‍ 13 (@lalaloveontour) October 25, 2021 does megan thee stallion attend college physically like can she even do that loll shes soo popular ppl would bother her all the time— LOGAN ROY (@ripofffsasuke) October 25, 2021 not megan thee stallion graduating from college on the same exact day as me!!
Megan Thee Stallion is giving us all some much-needed inspiration as the Grammy winner graduated from college over the past weekend. The Texas native took some dank pictures from her commencement ceremony, to which she wore a stunning, bedazzled ‘real hot girl sh*t’ cap. While Megan looked stunning, the cap certainly became a show-stealer, while giving a not-so-subtle nod to her 2019 hit song Hot Girl Summer. SEE: Meet Micah Beals, actor who allegedly vandalized George Floyd’s statues View this post on Instagram A post shar