# Text Generation task: Generate product names from description

# Data Cleaning and Preprocessing


# Importing the libraries

In [None]:
# Import utils and text processing libraries
import pandas as pd
import re
import string
import os, shutil
import unicodedata

# Import library to split our dataset
from sklearn.model_selection import train_test_split

In [None]:
#Check the wotrking dir
!cd

C:\Users\edumu\Google Drive\Projects\SpainAI NLP


We are going to remove the punctuations but we will keep the "." punctuation wich is necessary to work with sentences as a delimiter

In [None]:
punctuation = string.punctuation.replace('.', '')
print(punctuation)

!"#$%&'()*+,-/:;<=>?@[\]^_`{|}~


First, we set some global variables, including the data path and filenames.

In [None]:
# RUN THIS CELL When running out of Colab 
root_folder = "C:/Users/edumu/Google Drive"

In [None]:
# RUN THIS CELL When running ON Colab 
root_folder = '/content/drive/My Drive'

In [None]:
# Set the data path and filename of our dataset
data_folder_name='datasets/text_gen_product_names'
train_filename= 'train.csv'
test_filename= 'test_descriptions.csv'

data_path = os.path.abspath(os.path.join(root_folder, data_folder_name))
train_filenamepath = os.path.abspath(os.path.join(data_path, train_filename))
test_filenamepath = os.path.abspath(os.path.join(data_path, test_filename))

print('Filename: ',train_filenamepath)
datafile_type=os.path.splitext(train_filename)[1]
# Depending on the datafile we set the names of the columns to use
cols_to_use = ['name','description']
# Set the enriment variabe DATA if we need to copy the file from GCS
os.environ['DATA'] = data_path

Filename:  C:\Users\edumu\Google Drive\datasets\SpainAI NLP\train.csv


### Loading the dataset

We check if the datafile is a csv or excel file and then we load the file into a Pandas Dataframe. 

In [None]:
# Read the csv or excel file to a dataframe
if datafile_type == '.csv':
    data = pd.read_csv(train_filenamepath, encoding='utf-8')
elif datafile_type == '.xls' or datafile_type == '.xlsx':
    # Load the data from the file system
    data = pd.read_excel(DATADIR + data_filename, encoding='utf-8')
# Show the number of examples and the first 5 records
print("Number of examples: ", len(data))
data.head(5)

Number of examples:  33613


Unnamed: 0,name,description
0,CROPPED JACKET TRF,Jacket made of a technical fabric with texture...
1,OVERSIZED SHIRT WITH POCKET TRF,Oversized long sleeve shirt with a round colla...
2,TECHNICAL TROUSERS TRF,High-waist trousers with a matching elastic wa...
3,SHIRT DRESS,Collared dress featuring sleeves falling below...
4,PUFF SLEEVE DRESS WITH PLEATS TRF,Loose-fitting midi dress with a round neckline...


In [None]:
print(data['name'][50],'\n',data['description'][50],'\n')

SLEEVELESS KNIT TOP 
 Sleeveless knit top with a high neck. <br/><br/>HEIGHT OF MODEL: 177 cm. / 5′ 9″ 



Now we remove all the columns that are not needed in our exercise. We only save the xtext variable and the text variable. They become the text variable and summary variable.

In [None]:
# we are using the text variable as the summary and the ctext as the source text
dataset = data[cols_to_use].copy()
print(dataset.head(5))

                                name  \
0                 CROPPED JACKET TRF   
1    OVERSIZED SHIRT WITH POCKET TRF   
2             TECHNICAL TROUSERS TRF   
3                        SHIRT DRESS   
4  PUFF SLEEVE DRESS WITH PLEATS TRF   

                                         description  
0  Jacket made of a technical fabric with texture...  
1  Oversized long sleeve shirt with a round colla...  
2  High-waist trousers with a matching elastic wa...  
3  Collared dress featuring sleeves falling below...  
4  Loose-fitting midi dress with a round neckline...  


In [None]:
# Check the number of rows, null values, etc
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33613 entries, 0 to 33612
Data columns (total 2 columns):
name           33613 non-null object
description    33613 non-null object
dtypes: object(2)
memory usage: 525.3+ KB


Now that the text data is loaded in a dataframe we can check some basic descriptive statistics:

In [None]:
dataset.describe()

Unnamed: 0,name,description
count,33613,33613
unique,23428,31593
top,PRINTED DRESS,Short sleeve T-shirt with a round neckline and...
freq,61,48


Our feature field is the variable `description`, our source text, and there are some duplicates values in this field that we need to remove. Our dataset contains only one feature and one label, so any row with null values must be deleted.

Removing the examples with null values or duplicates.

In [None]:
#Remove duplicates on the ctext 
#dataset.drop_duplicates(subset=["description"],inplace=True)
#Remove rows containing null values
dataset.dropna(inplace=True)
#Recreate the dataframe index
dataset.reset_index(drop=True,inplace=True)
dataset.describe()

Unnamed: 0,name,description
count,33613,33613
unique,23428,31593
top,PRINTED DRESS,Short sleeve T-shirt with a round neckline and...
freq,61,48


In [None]:
print(dataset['name'][50],'\n',dataset['description'][50],'\n')

SLEEVELESS KNIT TOP 
 Sleeveless knit top with a high neck. <br/><br/>HEIGHT OF MODEL: 177 cm. / 5′ 9″ 



### Data cleanings and preprocess

Now it is time to apply some functions for common text cleaning operations like:

- Remove URLs
- Remove html tags
- Remove some emojis
- Expand common contractions
- Replace some special symbols
- Remove punctuation
- Remove non-character (Unicode \xFF)
- Remove break line \n
- Remove consecutive blanks
- Remove mention @
- Remove hastag #

In [None]:
# Create a mapping to handle contractions
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", 
                       "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", 
                       "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", 
                       "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", 
                       "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", 
                       "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", 
                       "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", 
                       "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", 
                       "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", 
                       "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have",
                       "o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", 
                       "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", 
                       "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", 
                       "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", 
                       "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", 
                       "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is",
                       "they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", 
                       "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", 
                       "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", 
                       "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  
                       "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", 
                       "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", 
                       "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", 
                       "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", 
                       "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have",
                       "y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", 
                       "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have" }

def expand_contractions(text, mapping):
    ''' Expand the contractions (some well-known of them) in a text'''
    specials = ["’", "‘", "´", "`"]
    for s in specials:
        text = text.replace(s, "'")
    text = ' '.join([mapping[t] if t in mapping else t for t in text.split(" ")])
    return text

In [None]:
special_mapping = {"‘": "'", "₹": "e", "´": "'", "°": "", "€": "e", "™": "tm", "√": " sqrt ", "×": "x", "²": "2", "—": "-", "–": "-", 
                 "’": "'", "_": "-", "`": "'", '“': '"', '”': '"', '“': '"', "£": "e", '∞': 'infinity', 'θ': 'theta', '÷': '/', 
                 'α': 'alpha', '•': '.', 'à': 'a', '−': '-', 'β': 'beta', '∅': '', '³': '3', 'π': 'pi', 'Â':'', 'Ł':'',
                'Ă': '', '&#39;':"'", '&#34;':'"', '&amp;':'&'}

def remove_URL(text):
    ''' Remove URLs from the text'''
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

# Remove html tag
def remove_html(text):
    ''' Remove HTML tags from the text'''
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

def remove_mention(text):
    ''' Remove mentions from the text'''
    url = re.compile(r'@\S*')
    return url.sub(r'',text)

def remove_mult_spaces(text):
    ''' Reduce multispace to one single space'''
    re_mult_space = re.compile(r"  *") # replace multiple spaces with just one
    return re_mult_space.sub(r' ', text)

def remove_non_character(text):
    ''' Remove some non alphanumeric characters'''
    url = re.compile(r'\x89\S*|\x9b\S*|\x92\S*|x93\S*|\x8a\S*|\x8f\S*|\x9d\S*|\x8c\S*|\x91\S*|\x87\S*|\x88\S*|\x82\S*')
    #url = re.compile(r'\x\d+\S*')
    return url.sub(r'',text)

def remove_punctuation(text, punctuation):
    ''' Remove punctuation from the text'''
    table=str.maketrans('','',punctuation)
    return text.translate(table)

def remove_emoji(text):
    ''' Remove emojis from the text'''
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

def remove_CTL(text):
    ''' Remove end of line from the text'''
    url = re.compile(r'\n')
    return url.sub(r' ',text)

def remove_hashtag(text):
    ''' Remove hashtags from the text'''
    url = re.compile(r'#\S*')
    return url.sub(r'',text)

def unicode_to_ascii(s):
    ''' Transform unicode symbols to ascii'''
    return ''.join(c for c in unicodedata.normalize('NFD', s)
                    if unicodedata.category(c) != 'Mn')

def replace_mapping(text, mapping):
    ''' Replace some special characters to an appropiate substitute'''
    for p in mapping:
        text = text.replace(p, mapping[p])

    return text


We will create a function that will apply all the cleaning functions to a pandas column, using the apply method.  

In [None]:
def clean_text(text):
    ''' Clean the input text using common techniques'''
    new_text = text
    #new_text=new_text.apply(lambda x : unicode_to_ascii(x))
    new_text=new_text.apply(lambda x : replace_mapping(x, special_mapping))
    new_text=new_text.apply(lambda x : remove_URL(x))
    new_text=new_text.apply(lambda x : remove_html(x))
    new_text=new_text.apply(lambda x : remove_emoji(x))
    new_text=new_text.apply(lambda x : remove_CTL(x))
    new_text=new_text.apply(lambda x : expand_contractions(x,contraction_mapping))
    new_text=new_text.apply(lambda x : remove_non_character(x))
    new_text=new_text.apply(lambda x : remove_mention(x))
    new_text=new_text.apply(lambda x : remove_hashtag(x))
    #new_text=new_text.apply(lambda x : remove_punctuation(x, punctuation))
    new_text=new_text.apply(lambda x : remove_mult_spaces(x))
    return new_text


## Tokenize and apply more complex preprocessing task
Next step, Tokenize or split our text into tokens or words. 

**Tokenization**: Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. A type is the class of all tokens containing the same character sequence*. 

These tokens will be our working unit. Then we can apply some preprocessing steps depending on the task: 

   - Lowercase
   - Remove very short tokens
   - Remove stop-words
   - Apply steeming and lemmatization


In [None]:
#Parameters to apply to the cleaing process
# If stopwords_file is defined then we take the stop words list from this file, if not we use the stopwords from NLTK
stopw= True
#stopwords_file='NLP_short_stopwords.txt'
stopwords_file=None
# Transform to lowercase
lowercase = True
# Set min length of a token
min_length = 2
# Set if we want to apply a lemmatizer or stemmer
lemmatize = False
stemming=False
# Remove punctuation
del_punct = True
# Remove digits
del_digits = False

### Stopwords

Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words . The general strategy for determining a stop list is to sort the terms by collection frequency (the total number of times each term appears in the document collection), and then to take the most frequent terms. Or you can use some predefined lists included in some NLP libraries like NLTK.
In my experience, and depending on our problem, some of this lists of stop words are too extensive, removing too much words. Sometimes you can create an specific list of words, based on the language of your problem. 

But removing stopwords in natural language generating tasks is not a good choice, these words have no significance but they are essential to generate understandable and grammatically correct text. But in text classification problems we can ignore the stop words to force the algorithm to focus on more relevant words. 

Define our own stopword list or if not we can use the NLTK stopwords for english. 

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')

if stopwords_file!=None:
    stop_words = set(w.rstrip() for w in open(stopwords_file))
else:
    stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    ''' Remove stopwords from the text'''
    w=set(text.split())-stop_words
    return ' '.join(w)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\edumu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Lemmatizer and Stemmer

Stemming and lemmatization are techniques "*to group together the inflected forms of a word so they can be analysed as a single item*", Wikipedia. So words with a similar root and meaning are considered as a unique word, optimizing the learning process.

Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might produce not an actual word whereas, lemma is an actual language word. Stemming follows an algorithm with steps to perform on the words which makes it faster. Whereas, in lemmatization, you used WordNet corpus and a corpus for stop words as well to produce lemma which makes it slower than stemming. You also had to define a parts-of-speech to obtain the correct lemma.

We will use a WordNetLemmatizer and a PorterStemmer, both are frequent options.

In [None]:
# Define Lemmatizer object
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Define the Porter Stemmer object
from nltk.stem import PorterStemmer
wordnet_stemmer = PorterStemmer()

### Define the tokenizer

In [None]:
def my_tokenizer(s):
    ''' Apply the tokenization method to the input string s.
        We will apply, based on the setup, lowercase, remove punctuation,remove digits, 
        remove stopwords, apply stemming, apply lemmatization and remove tokens with ocurrencies
        below a minimum defined.
    '''
    if lowercase:
        s = s.lower() # downcase
    if del_punct:
        s = remove_punctuation(s, punctuation) # Remove punctuation

    if del_digits:
        s = remove_digits(s)  # Remove digits

    tokens = word_tokenize(s) # split string into words (tokens)

    tokens = [t for t in tokens if len(t) > min_length or t in string.punctuation] # remove short words, they're probably not useful

    if stopw:
        tokens = [t for t in tokens if t not in stop_words] # remove stopwords

    if stemming:
        tokens = [wordnet_stemmer.stem(t) for t in tokens]  # Apply stemming 
        
    if lemmatize:
        tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens] # Lemmatize, put words into base form
        
    return tokens


In [None]:
len(dataset['description']),len(dataset['name'] )

(33613, 33613)

Now we can start the cleaning process on our dataset

In [None]:
# Apply the cleaning functions to the source text and summary
data_text = clean_text(dataset['description'])
data_headlines = clean_text(dataset['name'])

In [None]:
#Checking the results
len(data_text), len(data_headlines)

(33613, 33613)

In [None]:
data_text[50]

'Sleeveless knit top with a high neck. HEIGHT OF MODEL: 177 cm. / 5′ 9″'

In [None]:
# Save the cleaned and tokenize text in the dataframe
dataset['description']=data_text
dataset['name']=data_headlines

In [None]:
# Apply cleaning based on the tokenizer defined and parameters
def tokenize_and_string(sentences):
    ''' Tokenize the sentences, apply more cleaning steps
        Input: 
        - sentences: list of string, sentences of the text
        Output:
        - sentence2token: list of tokens or words, the tokens in the ever sentence
        - token2sent: list of string, the list of sentences
    '''
    # Tokenize and clean every text 
    sentence2token = [my_tokenizer(sent) for sent in sentences]
    # Convert tokens to a string
    token2sent= [' '.join(sent) for sent in sentence2token]
    
    return sentence2token,token2sent

In [None]:
# Tokenize the text in our dataset
_,cleaned_text=tokenize_and_string(data_text)
_,cleaned_headlines=tokenize_and_string(data_headlines)

Lets show some examples with our original text and the cleaned one

In [None]:
dataset['description'][50], cleaned_text[50]

('Sleeveless knit top with a high neck. HEIGHT OF MODEL: 177 cm. / 5′ 9″',
 'sleeveless knit top high neck . height model 177 .')

In [None]:
# Save the cleaned and tokenize text in the dataframe
dataset['description']=cleaned_text
dataset['name']=cleaned_headlines

## Save the data preprocessed and cleaned

We need to create a file for training our model and a different file for evaluation purpouses. This is very important in machine learning to simulate as much as possible the production enviroment were new and unseen examples will be received or feeded to our algorithm. 

The traning dataset will contain 85% of the examples and the remaning will be included in the test dataset.

In [None]:
# Set the percentage of rows to keep in the training dataset
train_pct=0.85
# Shuffle the whole dataframe
dataset = dataset.sample(frac=1, random_state=42).reset_index(drop=True)
#Define the size of the training dataset
training_size = int(len(data)*train_pct)
print('Training size: ', training_size)

# Create a file with the whole dataset cleaned
dataset.to_csv(data_path +'\\cl_stopw_'+os.path.splitext(train_filename)[0]+'.csv', encoding='utf-8', index=False)
#Create a file with the training dataset
dataset.iloc[:training_size,:].to_csv(data_path +'\\cl_stopw_train_'+os.path.splitext(train_filename)[0]+'.csv', 
                                      encoding='utf-8',index=False)
#Create a file with the test dataset
dataset.iloc[training_size:,:].to_csv(data_path +'\\cl_stopw_valid_'+os.path.splitext(train_filename)[0]+'.csv', 
                                      encoding='utf-8', index=False)

Training size:  28571


Save the train and validation dataset to GS

# Preprocess the test file

In [None]:
# Read the csv or excel file to a dataframe
if datafile_type == '.csv':
    data = pd.read_csv(test_filenamepath)
elif datafile_type == '.xls' or datafile_type == '.xlsx':
    # Load the data from the file system
    data = pd.read_excel(DATADIR + data_filename, encoding='utf-8')
# Show the number of examples and the first 5 records
print("Number of examples: ", len(data))
print('Num Examples: ',len(data))
print('Null Values\n', data.isna().sum())
data.head(5)

Number of examples:  1441
Num Examples:  1441
Null Values
 description    0
dtype: int64


Unnamed: 0,description
0,"Knit midi dress with a V-neckline, straps and ..."
1,"Loose-fitting dress with a round neckline, lon..."
2,Nautical cap with peak.<br/><br/>This item mus...
3,Nautical cap with peak. Adjustable inner strap...
4,Nautical cap with side button detail.<br/><br/...


In [None]:
# Apply the cleaning functions to the source text and summary
data_text = clean_text(data['description'])

In [None]:
data_text

0       Knit midi dress with a V-neckline, straps and ...
1       Loose-fitting dress with a round neckline, lon...
2       Nautical cap with peak.This item must be retur...
3       Nautical cap with peak. Adjustable inner strap...
4       Nautical cap with side button detail.This item...
                              ...                        
1436    Striped print cotton cushion cover. Cushion fi...
1437         Rectangular cushion featuring a gnome print.
1438    Cotton jersey eye mask featuring an elastic ba...
1439    Padded chipboard hanger featuring an iron hook...
1440    Iron hanger suitable for hanging all kinds of ...
Name: description, Length: 1441, dtype: object

In [None]:
# Tokenize the text in our dataset
_,cleaned_text=tokenize_and_string(data_text)

In [None]:
cleaned_text

['knit midi dress vneckline straps matching lace detail.height model 177 . 69.6″',
 'loosefitting dress round neckline long sleeves pleat details buttoned opening back.height model 177 69.6″',
 'nautical cap peak.this item must returned original cardboard packaging intact .',
 'nautical cap peak . adjustable inner strap detail .',
 'nautical cap side button detail.this item must returned original cardboard packaging intact .',
 'faded short sleeve tshirt round neckline front print.due dyeing process print tshirt unique may differ shown photo.height model 177 . 69.6″',
 'coat round collar long sleeves . featuring front welt pockets faux suede interior button fastening front . height model 177 . 69.6″',
 'ripped tshirt . round neck short sleevesheight model 176',
 'fitted top made polyamide blend . features wide straps chest reinforcement.model height 178',
 'fitted top made polyamide blend . features wide straps chest reinforcement.model height 177 69.6″',
 'fitted top made polyamide bl

In [None]:
# Save the cleaned and tokenize text in the dataframe
data['description']=cleaned_text

In [None]:
data_path +'\\cl_stopw_'+os.path.splitext(test_filename)[0]+'.csv'

'C:\\Users\\edumu\\Google Drive\\datasets\\SpainAI NLP\\cl_stopw_test_descriptions.csv'

In [None]:
data

Unnamed: 0,description
0,knit midi dress vneckline straps matching lace...
1,loosefitting dress round neckline long sleeves...
2,nautical cap peak.this item must returned orig...
3,nautical cap peak . adjustable inner strap det...
4,nautical cap side button detail.this item must...
...,...
1436,striped print cotton cushion cover . cushion f...
1437,rectangular cushion featuring gnome print .
1438,cotton jersey eye mask featuring elastic band ...
1439,padded chipboard hanger featuring iron hook pa...


In [None]:
# Create a file with the whole dataset cleaned
data.to_csv(data_path +'\\cl_stopw_'+os.path.splitext(test_filename)[0]+'.csv', index=False)

### Checking the files created

In [None]:
train_df=pd.read_csv("data/news_summary_train.csv")
print('Train Length: ',len(train_df))
train_df.head(20)

Train Length:  83606


Unnamed: 0,headlines,text
0,Paytm raises $1.4 billion from SoftBank in lar...,Digital payments startup Paytm has raised $1.4...
1,Petrol price cut by â¹1.12 per litre as daily...,Oil companies on Thursday reduced the petrol p...
2,Army plans to deploy women officers for cyber ...,The Indian Army has announced plans to deploy ...
3,Uday Chopra confirms YRF will produce Jessica ...,Yash Raj Films CEO Uday Chopra has confirmed t...
4,Mulayam Yadav to contest 2019 polls from Mainp...,Senior Samajwadi Party leader Ram Gopal Yadav ...
5,I am so excited to play under Virat Kohli's ca...,"Batsman Shubman Gill, who has been included in..."
6,Twitter reacts to women-only screening of Wond...,Reacting to Texas' The Alamo Drafthouse theatr...
7,Apple to hire engineer with psychology backgro...,Apple is hiring a software engineer with psych...
8,Railway Board Chairman resigns after 2 derailm...,"The Chairman of the Railway Board, Ashok Mitta..."
9,Who is billionaire Radhakishan Damani?,Radhakishan Damani is the 61-year-old billiona...


In [None]:
valid_df=pd.read_csv("data/news_summary_valid.csv")
print('Valid Length: ',len(valid_df))
valid_df.head(20)

Valid Length:  14754


Unnamed: 0,headlines,text
0,"Govt forms SIT in Ryan murder case, CBSE seeks...",The HRD Ministry has formed a three-member Spe...
1,"Indrani asks for furniture, jewellery in divor...","In a letter written from jail, Sheena Bora mur..."
2,"ED raids 35 premises of Nirav Modi, â¹550-cr ...",The Enforcement Directorate (ED) on Friday con...
3,Japan admits 1st death from 2011 Fukushima nuc...,Japan has acknowledged for the first time that...
4,An entire village in Germany is being auctione...,An entire village in Germany is being auctione...
5,"Man Utd luckiest PL team, Liverpool unluckiest...",Manchester United were the luckiest team while...
6,Raina overtakes Kohli to become IPL's top run-...,CSK's Suresh Raina overtook RCB captain Virat ...
7,Mumbai housing society saves â¹2 lakh a month...,Residents of a 20-storey housing complex in Ka...
8,NASA launches two Antarctic flights from two c...,For the first time in its nine years of operat...
9,Pentagon slammed for wasting â¹180 cr on Afgh...,US Defence Secretary James Mattis has criticis...


In [None]:
train_df['text'][0]

"Digital payments startup Paytm has raised $1.4 billion from SoftBank in India's largest funding round. This is also SoftBank's biggest investment in the Indian startup ecosystem till date. The latest investment by SoftBank will value Paytm at around $8 billion, up from its valuation of $4.8 billion in August 2016. "