# Text Summarising of the articles
## Text-Summarization
Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax.

Automatic data summarization is part of machine learning and data mining. The primary idea of summarization is to find a subset of data which contains the "information" of the entire set. Such techniques are widely used in industry today. Search engines are an example; others include summarization of documents, image collections and videos. Document summarization tries to create a representative summary or abstract of the entire document, by finding the most informative sentences, while in image summarization the system finds the most representative and important (i.e. salient) images. For surveillance videos, one might want to extract the important events from the uneventful context.

There are two general approaches to automatic summarization: Extraction and Abstraction.

#### 1. Extractive Summarization: 
These methods rely on extracting several parts, such as phrases and sentences, from a piece of text and stack them together to create a summary. Therefore, identifying the right sentences for summarization is of utmost importance in an extractive method.


#### 2. Abstractive Summarization: 
These methods use advanced NLP techniques to generate an entirely new summary. Some parts of this summary may not even appear in the original text. Such a summary might include verbal innovations. Research to date has focused primarily on extractive methods, which are appropriate for image collection summarization and video summarization.

In this Jupyter notebook, TextRank algorithm for extractive text summarization is implemented using Google's PageRank search algorithm to generate corelations among sentences.

#### Libraries Used
- Numpy
- Pandas
- Natural Language Toolkit

####  Algorithms and Concepts
- TextRank
- PageRank
- Cosine Similarity

### How to run
- Install the required libraries using pip, virtual environment or conda.
- Run jupyter notebook in your terminal

In [53]:
# Importing the required Libraries
import numpy as np
import pandas as pd
import nltk
# nltk.download('punkt') # one time execution
import re
#nltk.download('stopwords') # one time execution
import matplotlib.pyplot as plt

from nltk.tokenize import sent_tokenize

from nltk.corpus import stopwords

from sklearn.metrics.pairwise import cosine_similarity

import networkx as nx

### Word embedding
Word embedding is the technique to convert each word into an equivalent float vector. Various techniques exist depending on the use-case of the model and dataset. 
Example: 

'the': [-0.123, 0.353, 0.652, -0.232]


'the' is very often used word in texts of any kind. its equivalent 4-dimension dense vector has been given.


### Pretrained Word Embeddings for NLP
Pretrained Word Embeddings are the embeddings learned in one task that are used for solving another similar task.

These embeddings are trained on large datasets, saved, and then used for solving other tasks. That’s why pretrained word embeddings are a form of Transfer Learning.
#### Google’s Word2vec 
-Word2Vec is trained on the Google News dataset (about 100 billion words).
#### Stanford’s GloVe
- It stands for Global Vectors. This is created by Stanford University. Glove has pre-defined dense vectors for around every 6 billion words of English literature along with many other general use characters like comma, braces, and semicolons. 

In [54]:
# Extract word vectors
word_embeddings = {}

file = open(r'C:\Users\user\NLP\glove.6B.100d.txt', errors ='ignore', encoding ='utf-8')

for line in file:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
file.close()
len(word_embeddings)

400000

#### In this task, I will be summarising Medicine Descrptions provided in the file named  'TASK.xlsx'

In [55]:
# reading the file
df = pd.read_excel(r'C:\Users\user\NLP\TASK.xlsx')

In [56]:
df

Unnamed: 0,TEST DATASET,Unnamed: 1
0,,Introduction
1,,Acnesol Gel is an antibiotic that fights bacte...
2,,Ambrodil Syrup is used for treating various re...
3,,Augmentin 625 Duo Tablet is a penicillin-type ...
4,,Azithral 500 Tablet is an antibiotic used to t...
...,...,...
996,,Azapure Tablet belongs to a group of medicines...
997,,Arimidex 1mg Tablet is used alone or with oth...
998,,Arpimune ME 100mg Capsule is used to prevent y...
999,,Amlodac CH Tablet is a combination medicine us...


In [57]:
df.columns

Index(['TEST DATASET', 'Unnamed: 1'], dtype='object')

In [58]:
df.rename(columns = {'Unnamed: 1' : 'Introduction' }, inplace=True)
# Deleting the first row
df.drop(0)

Unnamed: 0,TEST DATASET,Introduction
1,,Acnesol Gel is an antibiotic that fights bacte...
2,,Ambrodil Syrup is used for treating various re...
3,,Augmentin 625 Duo Tablet is a penicillin-type ...
4,,Azithral 500 Tablet is an antibiotic used to t...
5,,Alkasol Oral Solution is a medicine used in th...
...,...,...
996,,Azapure Tablet belongs to a group of medicines...
997,,Arimidex 1mg Tablet is used alone or with oth...
998,,Arpimune ME 100mg Capsule is used to prevent y...
999,,Amlodac CH Tablet is a combination medicine us...


In [73]:
# Converting the DataFrame into a dictionary
text_dictionary = {}
for i in range(1,len(df['TEST DATASET'])):
    text_dictionary[i] = df['Introduction'][i]
    
print(text_dictionary[1])

Acnesol Gel is an antibiotic that fights bacteria. It is used to treat acne, which appears as spots or pimples on your face, chest or back. This medicine works by attacking the bacteria that cause these pimples.Acnesol Gel is only meant for external use and should be used as advised by your doctor. You should normally wash and dry the affected area before applying a thin layer of the medicine. It should not be applied to broken or damaged skin. Avoid any contact with your eyes, nose, or mouth. Rinse it off with water if you accidentally get it in these areas. It may take several weeks for your symptoms to improve, but you should keep using this medicine regularly. Do not stop using it as soon as your acne starts to get better. Ask your doctor when you should stop treatment.Common side effects like minor itching, burning, or redness of the skin and oily skin may be seen in some people. These are usually temporary and resolve on their own. Consult your doctor if they bother you or do not

#### There are 1000 such description of the different medicines. The task is to give summarised form of these description.

In [60]:
# function to remove stopwords
def remove_stopwords(sen):
    stop_words = stopwords.words('english')
    
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [61]:
# function to make vectors out of the sentences
def sentence_vector_func (sentences_cleaned) : 
    sentence_vector = []
    for i in sentences_cleaned:
        if len(i) != 0:
            v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
        else:
            v = np.zeros((100,))
        sentence_vector.append(v)
    
    return (sentence_vector)

In [68]:
# function to get the summary of the articles
# NOTE - Remove '#' infront of print statement for displaying the contents at different stages of the text summarisation process
def summary_text (test_text, n = 5):
    sentences = []
    
    # tokenising the text 
    sentences.append(sent_tokenize(test_text))
    #print(sentences)
    sentences = [y for x in sentences for y in x] # flatten list
    #print(sentences)
    
    # remove punctuations, numbers and special characters
    clean_sentences = pd.Series(sentences).str.replace("[^a-z A-Z 0-9]", " ")

    # make alphabets lowercase
    clean_sentences = [s.lower() for s in clean_sentences]
    #print(clean_sentences)

    
    # remove stopwords from the sentences
    clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
    #print(clean_sentences)
    
    sentence_vectors = sentence_vector_func(clean_sentences)
    
    # similarity matrix
    sim_mat = np.zeros([len(sentences), len(sentences)])
    #print(sim_mat)
    
    # Finding the similarities between the sentences 
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i != j:
                sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
    
    
    nx_graph = nx.from_numpy_array(sim_mat)
    scores = nx.pagerank(nx_graph)
    #print(scores)
    
    ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)))
    # Extract sentences as the summary
    summarised_string = ''
    for i in range(n):
        
        try:
            summarised_string = summarised_string + str(ranked_sentences[i][1])            
        except IndexError:
            print ("Summary Not Available")
    
    return (summarised_string)


In [None]:
print("Kindly let me know in how many sentences you want the summary - ")
x = int(input())

summary_dictionary = {}

for key in text_dictionary:
    
    para = text_dictionary[key]
    print("Summary of the article - ",key)
    summary = summary_text(para,x)
    summary_dictionary[key] = summary
    
    print(summary)
    print('='*120)    
    
print ("*"*40,"The process has been completed successfully","*"*40)

Kindly let me know in how many sentences you want the summary - 
