# Text summarization using machine learning techniques

The purpose of this notebook is to create a set of functions with some techniques frequently used in NLP tasks to clean and prepare the text data. Given a CSV file containing our source text and summaries we will transform them to a well-suited text data to train our text summarization models.


### Importing the libraries

In [1]:
# Import utils and text processing libraries
import pandas as pd
import re
import string
import os, shutil
import unicodedata

# Import library to split our dataset
from sklearn.model_selection import train_test_split

In [3]:
#Check the wotrking dir
#!pwd

We are going to remove the punctuations but we will keep the "." punctuation wich is necessary to work with sentences as a delimiter

In [2]:
punctuation = string.punctuation.replace('.', '')
print(punctuation)

!"#$%&'()*+,-/:;<=>?@[\]^_`{|}~


In [3]:
# Set the data path and filename of our dataset
DATADIR='data/'
data_filename= 'Inshorts Cleaned Data.xlsx'
datafile_type='xls'
# Depending on the datafile we set the names of the columns to use
cols_to_use = ['Headline','Short']
# Set the enriment variabe DATA if we need to copy the file from GCS
os.environ['DATA'] = DATADIR

**We must run the next cell the very first time or when the datafile has been modified**

In [None]:
# Only when we need to create or update the dataset file
#Create an enviroment variable wuth the data path and 
#shutil.rmtree(DATADIR, ignore_errors=True)
#os.makedirs(DATADIR)

In [55]:
%%bash
gsutil cp gs://mlend_text_summarization/data/news_summary/news_summary.csv ${DATA}

Copying gs://mlend_text_summarization/data/news_summary/news_summary.csv...
/ [1 files][ 11.4 MiB/ 11.4 MiB]                                                
Operation completed over 1 objects/11.4 MiB.                                     


### Loading the dataset with the examples

In [48]:
# Read the csv file to a dataframe
if datafile_type == 'csv':
    data = pd.read_csv(DATADIR + data_filename, encoding='utf-8')
elif datafile_type == 'xls':
    # Load the data from the file system
    data = pd.read_excel(DATADIR + data_filename, encoding='utf-8')

print("Number of examples: ", len(data))
data.head(5)

Number of examples:  55104


Unnamed: 0,Headline,Short,Source,Time,Publish Date
0,4 ex-bank officials booked for cheating bank o...,The CBI on Saturday booked four former officia...,The New Indian Express,09:25:00,2017-03-26
1,Supreme Court to go paperless in 6 months: CJI,Chief Justice JS Khehar has said the Supreme C...,Outlook,22:18:00,2017-03-25
2,"At least 3 killed, 30 injured in blast in Sylh...","At least three people were killed, including a...",Hindustan Times,23:39:00,2017-03-25
3,Why has Reliance been barred from trading in f...,Mukesh Ambani-led Reliance Industries (RIL) wa...,Livemint,23:08:00,2017-03-25
4,Was stopped from entering my own studio at Tim...,TV news anchor Arnab Goswami has said he was t...,YouTube,23:24:00,2017-03-25


In [5]:
print(data['Headline'][50],'\n',data['Short'][50],'\n',data['Source '][50])

Censor Board demands removal of phrase Mann Ki Baat in film 
 Censor Board has demanded the removal of phrase &#39;Mann Ki Baat&#39; from a dialogue in film &#39;Sameer&#39;, as it is also the title of PM Narendra Modi&#39;s radio show. The Board hasn&#39;t objected to an expletive in the same dialogue. &#34;The board, despite granting us an A certificate, asked for certain scenes to be chopped mercilessly,&#34; said the film&#39;s director. 
 India Today


In [46]:
print(data['Headline'][38703])

L&amp;T Q1 net profit up 46% to ₹610 cr


In [47]:
s=replace_mapping(data['Headline'][38703], special_mapping)
print(s)

L&T Q1 net profit up 46% to e610 cr


Removing the examples with null values or duplicates

In [49]:
#Remove duplicates on the ctext 
data.drop_duplicates(subset=["Short"],inplace=True)
#Remove rows containing null values
data.dropna(inplace=True)
#Recreate the dataframe index
data.reset_index(drop=True,inplace=True)
data.describe()

Unnamed: 0,Headline,Short,Source,Time,Publish Date
count,54997,54997,54997,54997,54997
unique,54939,54997,1471,1405,433
top,"Sensex, Nifty end on a flat note",Former Pakistan President Pervez Musharraf has...,YouTube,10:00:00,2016-10-11 00:00:00
freq,9,1,4685,127,234
first,,,,,2016-01-19 00:00:00
last,,,,,2017-03-26 00:00:00


In [24]:
# Check the number of rows, null values, etc
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54997 entries, 0 to 54996
Data columns (total 5 columns):
Headline        54997 non-null object
Short           54997 non-null object
Source          54997 non-null object
Time            54997 non-null object
Publish Date    54997 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(4)
memory usage: 2.1+ MB


Now we remove all the columns that are not needed in our exercise. We only save the xtext variable and the text variable. They become the text variable and summary variable.

In [50]:
# we are using the text variable as the summary and the ctext as the source text
dataset = data[cols_to_use].copy()
dataset.columns = ['summary','text']
print(dataset.head(5))

                                             summary  \
0  4 ex-bank officials booked for cheating bank o...   
1     Supreme Court to go paperless in 6 months: CJI   
2  At least 3 killed, 30 injured in blast in Sylh...   
3  Why has Reliance been barred from trading in f...   
4  Was stopped from entering my own studio at Tim...   

                                                text  
0  The CBI on Saturday booked four former officia...  
1  Chief Justice JS Khehar has said the Supreme C...  
2  At least three people were killed, including a...  
3  Mukesh Ambani-led Reliance Industries (RIL) wa...  
4  TV news anchor Arnab Goswami has said he was t...  


In [26]:
print(dataset['summary'][50],'\n',dataset['text'][50],'\n')

Censor Board demands removal of phrase Mann Ki Baat in film 
 Censor Board has demanded the removal of phrase &#39;Mann Ki Baat&#39; from a dialogue in film &#39;Sameer&#39;, as it is also the title of PM Narendra Modi&#39;s radio show. The Board hasn&#39;t objected to an expletive in the same dialogue. &#34;The board, despite granting us an A certificate, asked for certain scenes to be chopped mercilessly,&#34; said the film&#39;s director. 



### Data cleanings and preprocess

Now it is time to apply some functions for common text cleaning operations like:
- Remove URLs
- Remove html tags
- Remove some emojis
- Expand common contractions
- Expand some Slang abbrevation
- Remove punctuation
- Remove non-character (Unicode \xFF)
- Remove break line \n
- Remove &amp
- Remove mention @
- Remove hastag #

In [51]:
# Create a mapping to handle contractions
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", 
                       "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", 
                       "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", 
                       "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", 
                       "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", 
                       "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", 
                       "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", 
                       "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", 
                       "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", 
                       "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have",
                       "o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", 
                       "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", 
                       "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", 
                       "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", 
                       "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", 
                       "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is",
                       "they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", 
                       "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", 
                       "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", 
                       "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  
                       "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", 
                       "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", 
                       "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", 
                       "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", 
                       "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have",
                       "y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", 
                       "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have" }

def expand_contractions(text, mapping):
    ''' Expand the contractions (some well-known of them) in a text'''
    specials = ["’", "‘", "´", "`"]
    for s in specials:
        text = text.replace(s, "'")
    text = ' '.join([mapping[t] if t in mapping else t for t in text.split(" ")])
    return text

In [52]:
special_mapping = {"‘": "'", "₹": "e", "´": "'", "°": "", "€": "e", "™": "tm", "√": " sqrt ", "×": "x", "²": "2", "—": "-", "–": "-", 
                 "’": "'", "_": "-", "`": "'", '“': '"', '”': '"', '“': '"', "£": "e", '∞': 'infinity', 'θ': 'theta', '÷': '/', 
                 'α': 'alpha', '•': '.', 'à': 'a', '−': '-', 'β': 'beta', '∅': '', '³': '3', 'π': 'pi', 'Â':'', 'Ł':'',
                'Ă': '', '&#39;':"'", '&#34;':'"', '&amp;':'&'}

def remove_URL(text):
    ''' Remove URLs from the text'''
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

# Remove html tag
def remove_html(text):
    ''' Remove HTML tags from the text'''
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

def remove_mention(text):
    ''' Remove mentions from the text'''
    url = re.compile(r'@\S*')
    return url.sub(r'',text)

def remove_mult_spaces(text):
    ''' Reduce multispace to one single space'''
    re_mult_space = re.compile(r"  *") # replace multiple spaces with just one
    return re_mult_space.sub(r' ', text)

def remove_non_character(text):
    ''' Remove some non alphanumeric characters'''
    url = re.compile(r'\x89\S*|\x9b\S*|\x92\S*|x93\S*|\x8a\S*|\x8f\S*|\x9d\S*|\x8c\S*|\x91\S*|\x87\S*|\x88\S*|\x82\S*')
    #url = re.compile(r'\x\d+\S*')
    return url.sub(r'',text)

def remove_punctuation(text, punctuation):
    ''' Remove punctuation from the text'''
    table=str.maketrans('','',punctuation)
    return text.translate(table)

def remove_emoji(text):
    ''' Remove emojis from the text'''
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

def remove_CTL(text):
    ''' Remove end of line from the text'''
    url = re.compile(r'\n')
    return url.sub(r' ',text)

def remove_hashtag(text):
    ''' Remove hashtags from the text'''
    url = re.compile(r'#\S*')
    return url.sub(r'',text)

# Converts the unicode file to ascii
def unicode_to_ascii(s):
  return ''.join(c for c in unicodedata.normalize('NFD', s)
      if unicodedata.category(c) != 'Mn')

def replace_mapping(text, mapping):
    for p in mapping:
        text = text.replace(p, mapping[p])

    return text


We will create a function that will call all the others cleaning functions

In [53]:
def clean_text(text):
    ''' Clean the input text using common techniques'''
    new_text = text
    #new_text=new_text.apply(lambda x : unicode_to_ascii(x))
    new_text=new_text.apply(lambda x : replace_mapping(x, special_mapping))
    new_text=new_text.apply(lambda x : remove_URL(x))
    new_text=new_text.apply(lambda x : remove_html(x))
    new_text=new_text.apply(lambda x : remove_emoji(x))
    new_text=new_text.apply(lambda x : expand_contractions(x,contraction_mapping))
    new_text=new_text.apply(lambda x : remove_non_character(x))
    new_text=new_text.apply(lambda x : remove_CTL(x))
    new_text=new_text.apply(lambda x : remove_mention(x))
    new_text=new_text.apply(lambda x : remove_hashtag(x))
    #new_text=new_text.apply(lambda x : remove_punctuation(x, punctuation))
    new_text=new_text.apply(lambda x : remove_mult_spaces(x))
    return new_text


## Tokenize and apply more complex preprocessing task
Next step, Tokenize or split our text into tokens or words. *Tokenization: Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. A type is the class of all tokens containing the same character sequence*. These tokens will be our working unit. Then we can apply some preprocessing steps depending on the task: 

   - Lowercase
   - Remove very short tokens
   - Remove Stopwors
   - Steeming and Lemmatization


In [28]:
#Parameters to apply to the cleaing process

# If stopwords_file is defined then we take the stop words list from this file, if not we use the stopwords from NLTK
stopw= False
#stopwords_file='NLP_short_stopwords.txt'
stopwords_file=None
# Transform to lowercase
lowercase = True
# Set min length of a token
min_length = 0
# Set if we want to apply a lemmatizer or stemmer
lemmatize = False
stemming=False
# Remove punctuation
del_punct = False
# Remove digits
del_digits = False

### Stopwords

Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words . The general strategy for determining a stop list is to sort the terms by collection frequency (the total number of times each term appears in the document collection), and then to take the most frequent terms.

Define our own stopword list or if not we can use the NLTK stopwords for english. 

In [65]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')

if stopwords_file!=None:
    stop_words = set(w.rstrip() for w in open(stopwords_file))
else:
    stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    ''' Remove stopwords from the text'''
    w=set(text.split())-stop_words
    return ' '.join(w)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\edumu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Lemmatizer and Stemmer

Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word. Stemming follows an algorithm with steps to perform on the words which makes it faster. Whereas, in lemmatization, you used WordNet corpus and a corpus for stop words as well to produce lemma which makes it slower than stemming. You also had to define a parts-of-speech to obtain the correct lemma.

We will use a WordNetLemmatizer and a PorterStemmer, both are frequent options.

In [20]:
# Define Lemmatizer object
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Define the Porter Stemmer object
from nltk.stem import PorterStemmer
wordnet_stemmer = PorterStemmer()

### Define the tokenizer

In [21]:
def my_tokenizer(s):
    ''' Apply the tokenization method to the input string s.
        We will apply, based on the setup, lowercase, remove punctuation,remove digits, 
        remove stopwords, apply stemming, apply lemmatization and remove tokens with ocurrencies
        below a minimum defined.
    '''
    if lowercase:
        s = s.lower() # downcase
    if del_punct:
        s=remove_punctuation(s, punctuation) # Remove punctuation

    if del_digits:
        s=remove_digits(s)  # Remove digits

    tokens = word_tokenize(s) # split string into words (tokens)

    tokens = [t for t in tokens if len(t) > min_length] # remove short words, they're probably not useful

    if stopw:
        tokens = [t for t in tokens if t not in stop_words] # remove stopwords

    if stemming:
        tokens = [wordnet_stemmer.stem(t) for t in tokens]  # Apply stemming 
        
    if lemmatize:
        tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens] # Lemmatize, put words into base form
        
    return tokens


In [29]:
len(dataset['text']),len(dataset['summary'] )

(54997, 54997)

Now we can start the cleaning process on our dataset

In [54]:
# Apply the cleaning functions to the source text and summary
data_text = clean_text(dataset['text'])
data_headlines = clean_text(dataset['summary'])

In [55]:
#Checking the results
len(data_text), len(data_headlines)

(54997, 54997)

In [32]:
data_text[50]

'Censor Board has demanded the removal of phrase \'Mann Ki Baat\' from a dialogue in film \'Sameer\', as it is also the title of PM Narendra Modi\'s radio show. The Board has not objected to an expletive in the same dialogue. "The board, despite granting us an A certificate, asked for certain scenes to be chopped mercilessly," said the film\'s director.'

In [56]:
# Save the cleaned and tokenize text in the dataframe
dataset['text']=data_text
dataset['summary']=data_headlines

In [68]:
# Apply cleaning based on the tokenizer defined and parameters
def tokenize_and_string(sentences):
    ''' Tokenize the sentences, apply more cleaning steps
        Input: 
        - sentences: list of string, sentences of the text
        Output:
        - sentence2token: list of tokens or words, the tokens in the ever sentence
        - token2sent: list of string, the list of sentences
    '''
    # Tokenize and clean every text 
    sentence2token = [my_tokenizer(sent) for sent in sentences]
    # Convert tokens to a string
    token2sent= [' '.join(sent) for sent in sentence2token]
    
    return sentence2token,token2sent

In [69]:
# Tokenize the text in our dataset
_,cleaned_text=tokenize_and_string(data_text)
_,cleaned_headlines=tokenize_and_string(data_headlines)

Lets show some examples with our original text and the cleaned one

In [70]:
dataset['text'][50], cleaned_text[50]

('Censor Board has demanded the removal of phrase &#39;Mann Ki Baat&#39; from a dialogue in film &#39;Sameer&#39;, as it is also the title of PM Narendra Modi&#39;s radio show. The Board hasn&#39;t objected to an expletive in the same dialogue. &#34;The board, despite granting us an A certificate, asked for certain scenes to be chopped mercilessly,&#34; said the film&#39;s director.',
 "censor board has demanded the removal of phrase 'mann ki baat ' from a dialogue in film 'sameer ' , as it is also the title of pm narendra modi 's radio show . the board has not objected to an expletive in the same dialogue . & board , despite granting us an a certificate , asked for certain scenes to be chopped mercilessly , & said the film 's director .")

In [71]:
# Save the cleaned and tokenize text in the dataframe
dataset['text']=cleaned_text
dataset['summary']=cleaned_headlines

## Save the data preprocessed and cleaned

We need to create a file for training our model and a different file for evaluation purpouses. This is very important in machine learning to simulate as much as possible the production enviroment were new and unseen examples will be received or feeded to our algorithm. 

The traning dataset will contain 85% of the examples and the remaning will be included in the test dataset.

In [57]:
# Set the percentage of rows to keep in the training dataset
train_pct=0.85
# Shuffle the whole dataframe
dataset = dataset.sample(frac=1, random_state=42).reset_index(drop=True)
#Define the size of the training dataset
training_size = int(len(data)*train_pct)
print('Training size: ', training_size)

# Create a file with the whole dataset cleaned
dataset.to_csv(DATADIR +'cl_'+data_filename, encoding='utf-8', index=False)
#Create a file with the training dataset
dataset.iloc[:training_size,:].to_csv(DATADIR +'cl_train_'+data_filename, encoding='utf-8',index=False)
#Create a file with the test dataset
dataset.iloc[training_size:,:].to_csv(DATADIR +'cl_valid_'+data_filename, encoding='utf-8', index=False)

Training size:  46747


Save the train and validation dataset to GS

In [112]:
%%bash
gsutil cp data/cl_news_summary_train.csv gs://mlend_text_summarization/data/news_summary/
gsutil cp data/cl_news_summary_valid.csv gs://mlend_text_summarization/data/news_summary/
gsutil cp data/cl_news_summary_more.csv gs://mlend_text_summarization/data/news_summary/

CommandException: No URLs matched: data/cl_news_summary_train.csv
CommandException: No URLs matched: data/cl_news_summary_valid.csv
CommandException: No URLs matched: data/cl_news_summary_more.csv


CalledProcessError: Command 'b'gsutil cp data/cl_news_summary_train.csv gs://mlend_text_summarization/data/news_summary/\ngsutil cp data/cl_news_summary_valid.csv gs://mlend_text_summarization/data/news_summary/\ngsutil cp data/cl_news_summary_more.csv gs://mlend_text_summarization/data/news_summary/\n'' returned non-zero exit status 1.

### Checking the files created

In [24]:
train_df=pd.read_csv("data/news_summary_train.csv")
print('Train Length: ',len(train_df))
train_df.head(20)

Train Length:  83606


Unnamed: 0,headlines,text
0,Paytm raises $1.4 billion from SoftBank in lar...,Digital payments startup Paytm has raised $1.4...
1,Petrol price cut by â¹1.12 per litre as daily...,Oil companies on Thursday reduced the petrol p...
2,Army plans to deploy women officers for cyber ...,The Indian Army has announced plans to deploy ...
3,Uday Chopra confirms YRF will produce Jessica ...,Yash Raj Films CEO Uday Chopra has confirmed t...
4,Mulayam Yadav to contest 2019 polls from Mainp...,Senior Samajwadi Party leader Ram Gopal Yadav ...
5,I am so excited to play under Virat Kohli's ca...,"Batsman Shubman Gill, who has been included in..."
6,Twitter reacts to women-only screening of Wond...,Reacting to Texas' The Alamo Drafthouse theatr...
7,Apple to hire engineer with psychology backgro...,Apple is hiring a software engineer with psych...
8,Railway Board Chairman resigns after 2 derailm...,"The Chairman of the Railway Board, Ashok Mitta..."
9,Who is billionaire Radhakishan Damani?,Radhakishan Damani is the 61-year-old billiona...


In [25]:
valid_df=pd.read_csv("data/news_summary_valid.csv")
print('Valid Length: ',len(valid_df))
valid_df.head(20)

Valid Length:  14754


Unnamed: 0,headlines,text
0,"Govt forms SIT in Ryan murder case, CBSE seeks...",The HRD Ministry has formed a three-member Spe...
1,"Indrani asks for furniture, jewellery in divor...","In a letter written from jail, Sheena Bora mur..."
2,"ED raids 35 premises of Nirav Modi, â¹550-cr ...",The Enforcement Directorate (ED) on Friday con...
3,Japan admits 1st death from 2011 Fukushima nuc...,Japan has acknowledged for the first time that...
4,An entire village in Germany is being auctione...,An entire village in Germany is being auctione...
5,"Man Utd luckiest PL team, Liverpool unluckiest...",Manchester United were the luckiest team while...
6,Raina overtakes Kohli to become IPL's top run-...,CSK's Suresh Raina overtook RCB captain Virat ...
7,Mumbai housing society saves â¹2 lakh a month...,Residents of a 20-storey housing complex in Ka...
8,NASA launches two Antarctic flights from two c...,For the first time in its nine years of operat...
9,Pentagon slammed for wasting â¹180 cr on Afgh...,US Defence Secretary James Mattis has criticis...


In [12]:
train_df['text'][0]

"Digital payments startup Paytm has raised $1.4 billion from SoftBank in India's largest funding round. This is also SoftBank's biggest investment in the Indian startup ecosystem till date. The latest investment by SoftBank will value Paytm at around $8 billion, up from its valuation of $4.8 billion in August 2016. "