# 1. DATASET LINK: https://www.kaggle.com/datasets/sunnysai12345/news-summary

# 2. Dataset Brief Description:  
The first dataset consists of 4515 examples and contains the Author's name, Headlines, URL of the Article, Short text, and Complete Article. I gathered the summarized news in shorts and only scraped the news articles from the Hindu, Indian Times, and Guardian. The period ranges from February to August 2017

The second dataset consists of 98402 rows with 2 columns labeled as headlines and text.


To increase the intake of possible text values to build a reliable model as we are working on text summarization on news articles, we have merged these datasets before preprocessing and cleaning. Now the dataset contains 102915 rows and 2 columns labeled as text and summary while the text column has some null values.


# 3. Data Dictionary:
Raw Datasets:

Dataset 1:  Name: news_summary.csv


| Column Name | Data Type    | Description                            |
|-------------|--------------|----------------------------------------|
| author      | String/object| Contain the news author-name           |
| date        | String/object| Date of publication                    |
| headlines   | String/object| Headline Of the news                   |
| read_more   | String/link  | News link                              |
| text        | String/object| Short or summary of the article        |
| ctext       | String/object| The main content of the news.           |



.



Dataset 2:  Name: news_summary_more.csv

| Column Name | Data Type    | Description                              |
|-------------|--------------|------------------------------------------|
| headlines   | String/object| Contains the headline of the news.       |
| text        | String/object| Contains the content of the news.        |

.




Merged Dataset:
The raw dataset we found, we merged it to create our preferred dataset the so-called “Tuned News Summary”.



| Column Name | Data Type    | Description                                                             |
|-------------|--------------|-------------------------------------------------------------------------|
| text        | String/object| Contains the long text or description of the article. Used for training.|
| summary     | String/object| Contains the summary of the particular article. Predicted by the model. |






# MOUNT DRIVE

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 4. LOAD DATASETS

In [3]:
import pandas as pd

summary = pd.read_csv('/content/drive/MyDrive/CSE 475 TEXT SUMMARIZATION/DATASET/news_summary.csv',
                      encoding='iso-8859-1')
raw = pd.read_csv('/content/drive/MyDrive/CSE 475 TEXT SUMMARIZATION/DATASET/news_summary_more.csv',
                  encoding='iso-8859-1')

# 5. Properties Of Datasets

In [9]:
summary.head()

Unnamed: 0,author,date,headlines,read_more,text,ctext
0,Chhavi Tyagi,"03 Aug 2017,Thursday",Daman & Diu revokes mandatory Rakshabandhan in...,http://www.hindustantimes.com/india-news/raksh...,The Administration of Union Territory Daman an...,The Daman and Diu administration on Wednesday ...
1,Daisy Mowke,"03 Aug 2017,Thursday",Malaika slams user who trolled her for 'divorc...,http://www.hindustantimes.com/bollywood/malaik...,Malaika Arora slammed an Instagram user who tr...,"From her special numbers to TV?appearances, Bo..."
2,Arshiya Chopra,"03 Aug 2017,Thursday",'Virgin' now corrected to 'Unmarried' in IGIMS...,http://www.hindustantimes.com/patna/bihar-igim...,The Indira Gandhi Institute of Medical Science...,The Indira Gandhi Institute of Medical Science...
3,Sumedha Sehra,"03 Aug 2017,Thursday",Aaj aapne pakad liya: LeT man Dujana before be...,http://indiatoday.intoday.in/story/abu-dujana-...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...
4,Aarushi Maheshwari,"03 Aug 2017,Thursday",Hotel staff to get training to spot signs of s...,http://indiatoday.intoday.in/story/sex-traffic...,Hotels in Maharashtra will train their staff t...,Hotels in Mumbai and other Indian cities are t...


In [10]:
raw.head()

Unnamed: 0,headlines,text
0,upGrad learner switches to career in ML & Al w...,"Saurav Kant, an alumnus of upGrad and IIIT-B's..."
1,Delhi techie wins free food from Swiggy for on...,Kunal Shah's credit card bill payment platform...
2,New Zealand end Rohit Sharma-led India's 12-ma...,New Zealand defeated India by 8 wickets in the...
3,Aegon life iTerm insurance plan helps customer...,"With Aegon Life iTerm Insurance plan, customer..."
4,"Have known Hirani for yrs, what if MeToo claim...",Speaking about the sexual harassment allegatio...


In [15]:
import os

# Get the number of rows and columns
summary_shape = summary.shape
raw_shape = raw.shape

# Get the data types
summary_dtypes = summary.dtypes
raw_dtypes = raw.dtypes

# Print the results
print(f"Summary dataset: Size = {summary_size} MB, Shape = {summary_shape}, Data types = {summary_dtypes}")
print(f"Raw dataset: Size = {raw_size} MB, Shape = {raw_shape}, Data types = {raw_dtypes}")

Summary dataset: Size = 11.345305442810059 MB, Shape = (4514, 6), Data types = author       object
date         object
headlines    object
read_more    object
text         object
ctext        object
dtype: object
Raw dataset: Size = 39.48142051696777 MB, Shape = (98401, 2), Data types = headlines    object
text         object
dtype: object


In [11]:
pre1 = raw.iloc[:, 0:2].copy()
pre2 = summary.iloc[:, 0:6].copy()

# To increase the intake of possible text values to build a reliable model
pre2['text'] = pre2['author'].str.cat(pre2['date'
        ].str.cat(pre2['read_more'].str.cat(pre2['text'
        ].str.cat(pre2['ctext'], sep=' '), sep=' '), sep=' '), sep=' ')

pre = pd.DataFrame()
pre['text'] = pd.concat([pre1['text'], pre2['text']], ignore_index=True)
pre['summary'] = pd.concat([pre1['headlines'], pre2['headlines']],
                           ignore_index=True)

In [18]:
pre.head(10)

Unnamed: 0,text,summary
0,"Saurav Kant, an alumnus of upGrad and IIIT-B's...",upGrad learner switches to career in ML & Al w...
1,Kunal Shah's credit card bill payment platform...,Delhi techie wins free food from Swiggy for on...
2,New Zealand defeated India by 8 wickets in the...,New Zealand end Rohit Sharma-led India's 12-ma...
3,"With Aegon Life iTerm Insurance plan, customer...",Aegon life iTerm insurance plan helps customer...
4,Speaking about the sexual harassment allegatio...,"Have known Hirani for yrs, what if MeToo claim..."
5,Pakistani singer Rahat Fateh Ali Khan has deni...,Rahat Fateh Ali Khan denies getting notice for...
6,India recorded their lowest ODI total in New Z...,"India get all out for 92, their lowest ODI tot..."
7,Weeks after ex-CBI Director Alok Verma told th...,Govt directs Alok Verma to join work 1 day bef...
8,Andhra Pradesh CM N Chandrababu Naidu has said...,Called PM Modi 'sir' 10 times to satisfy his e...
9,Congress candidate Shafia Zubair won the Ramga...,"Cong wins Ramgarh bypoll in Rajasthan, takes t..."


In [16]:
pre_size_mb = pre.memory_usage().sum() / (1024 * 1024)
pre_shape = pre.shape
pre_datatypes = pre.dtypes

# Print the results
print(f"Preprocessed dataset: Size = {pre_size_mb:.2f} MB, Shape = {pre_shape}, Data types = {pre_datatypes}")


Preprocessed dataset: Size = 1.57 MB, Shape = (102915, 2), Data types = text       object
summary    object
dtype: object


# 6. Variables Intro (Text to Defination)

In [33]:

import numpy as np

# Set random seed for reproducibility
np.random.seed(50)

# Sample 100 random rows from the DataFrame
random_sample = pre.sample(n=200, random_state=50)

# Apply the function to the 'text' column for the random sample
total_words_text_random_sample = random_sample['text'].fillna('').apply(count_words).sum()

# Calculate the average number of words per row in the 'text' column
average_words_text_per_row = total_words_text_random_sample / 200

# Apply the function to the 'summary' column for the random sample
total_words_summary_random_sample = random_sample['summary'].fillna('').apply(count_words).sum()

# Calculate the average number of words per row in the 'summary' column
average_words_summary_per_row = total_words_summary_random_sample / 200

# Display the results
#print("For 100 random rows:")
#print("Total number of words in the 'text' column:", total_words_text_random_sample)
print("Average number of words per row in the 'text' column:", average_words_text_per_row)
#print("Total number of words in the 'summary' column:", total_words_summary_random_sample)
print("Average number of words per row in the 'summary' column:", average_words_summary_per_row)


Average number of words per row in the 'text' column: 76.27
Average number of words per row in the 'summary' column: 9.53
