<a href="https://colab.research.google.com/github/YanSong97/NLP-project/blob/master/wikihow_process.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Convert wikihowAll.csv to txt files
drive folder: https://drive.google.com/drive/folders/15kt6NJOgzqeAiUJ2ezS2mXUtKY2DKXko?usp=sharing

input:
- wikihowAll.csv

output:    
- train.article.txt: 193828 rows, 90% of the dataset
- train.summary.txt: as above
- val.article.txt: 10768 rows, 5% of the dataset
- val.summary.txt: as above
- test.article.txt: 10769 rows, 5% of the dataset
- test.summary.txt: as above

(Same format as raw txt data of cnn/daily mail datasets: each row containing texts for one article or summary; for summary each sentence is started/ended with indicators)

In [0]:
import numpy as np
import pandas as pd
import os
import re

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
# read data from the csv file (from the location it is stored)
Data = pd.read_csv(r'/content/drive/My Drive/NLP PROJ/wikihow/wikihowAll.csv')
Data = Data.astype(str)
rows, columns = Data.shape

In [5]:
Data

Unnamed: 0,headline,title,text
0,"\nKeep related supplies in the same area.,\nMa...",How to Be an Organized Artist1,"If you're a photographer, keep all the necess..."
1,\nCreate a sketch in the NeoPopRealist manner ...,How to Create a Neopoprealist Art Work,See the image for how this drawing develops s...
2,"\nGet a bachelor’s degree.,\nEnroll in a studi...",How to Be a Visual Effects Artist1,It is possible to become a VFX artist without...
3,\nStart with some experience or interest in ar...,How to Become an Art Investor,The best art investors do their research on t...
4,"\nKeep your reference materials, sketches, art...",How to Be an Organized Artist2,"As you start planning for a project or work, ..."
...,...,...,...
215360,\nConsider changing the spelling of your name....,How to Pick a Stage Name3,"If you have a name that you like, you might f..."
215361,"\nTry out your name.,\nDon’t legally change yo...",How to Pick a Stage Name4,Your name might sound great to you when you s...
215362,"\nUnderstand the process of relief printing.,\...",How to Identify Prints1,Relief printing is the oldest and most tradit...
215363,\nUnderstand the process of intaglio printing....,How to Identify Prints2,"Intaglio is Italian for ""incis­ing,"" and corr..."


#### original process.py for wikihow data set (dont run it for now!)

(code source: https://github.com/mahnazkoupaee/WikiHow-Dataset)

'''
This code is used to create article and summary files from the csv file.
The output of the file will be a directory of text files representing seoarate articles and their summaries.
Each summary line starts with tag "@summary" and the article is followed by "@article".
'''

input: wikihowAll.csv      
output: title.txt  +  a folder named "article" containing 215365 txt files (one for each article)  

Link of drive folder: https://drive.google.com/drive/folders/1_8s_A0OC5153gktx6dSbzLh02QJtI9LS?usp=sharing

%time: the whole notebook tooks 70 mins on colab 

In [0]:
# create a file to record the file names. This can be later used to divide the dataset in train/dev/test sets
title_file = open('/content/drive/My Drive/NLP PROJ/wikihow/titles.txt', 'wb')

# The path where the articles are to be saved
path = "/content/drive/My Drive/NLP PROJ/wikihow/articles"
if not os.path.exists(path): os.makedirs(path)

# go over the all the articles in the data file
for row in range(rows):
    abstract = Data.iloc[row,0]      # headline is the column representing the summary sentences
    article = Data.iloc[row,2]           # text is the column representing the article

    #  a threshold is used to remove short articles with long summaries as well as articles with no summary
    if len(abstract) < (0.75*len(article)):
        # remove extra commas in abstracts
        abstract = abstract.replace(".,",".")
        abstract = abstract.encode('utf-8')
        # remove extra commas in articles
        article = re.sub(r'[.]+[\n]+[,]',".\n", article)
        article = article.encode('utf-8')
        

        # a temporary file is created to initially write the summary, it is later used to separate the sentences of the summary
        with open('/content/drive/My Drive/NLP PROJ/wikihow/temporaryFile.txt','wb') as t:
            t.write(abstract)
        
        # file names are created using the alphanumeric charachters from the article titles.
        # they are stored in a separate text file.
        filename = Data.iloc[row,1]
        filename = "".join(x for x in filename if x.isalnum())
        filename1 = filename + '.txt'
        filename = filename.encode('utf-8')
        title_file.write(filename+b'\n')

        
        with open(path+'/'+filename1,'wb') as f:
            # summary sentences will first be written into the file in separate lines
            with open('/content/drive/My Drive/NLP PROJ/wikihow/temporaryFile.txt','r') as t:
                for line in t:
                    line=line.lower()
                    if line != "\n" and line != "\t" and line != " ":
                        f.write(b'@summary'+b'\n')
                        f.write(line.encode('utf-8'))
                        f.write(b'\n')
                    
            # finally the article is written to the file
            f.write(b'@article' + b'\n')    
            f.write(article)

title_file.close()

#### Modified proseccing code creating txt files to be integrate to the general data pipeline

In [0]:
train_range = np.arange(int(rows*0.9))
val_range = np.arange(int(rows*0.9),int(rows*0.95))
test_range = np.arange(int(rows*0.95),rows)

- train set

In [0]:
################################################################################################################################
# path for article.txt
train_article  = open('/content/drive/My Drive/NLP PROJ/wikihow/train.article.txt', 'wb')

# go over row by row
for row in train_range:            
    abstract = Data.iloc[row,0]     
    article = Data.iloc[row,2]          

    #  a threshold is used to remove short articles with long summaries as well as articles with no summary
    if len(abstract) < (0.75*len(article)):
        # remove extra commas in articles
        article = re.sub(r'[.]+[\n]+[,]',". ", article)
        article = re.sub(r'[\n]'," ", article)
        article = article.encode('utf-8')
        
        # write to the file
        train_article.write(article)
        train_article.write(b'\n')    
            
train_article.close()
################################################################################################################################

################################################################################################################################
# path for summary.txt
train_summary  = open('/content/drive/My Drive/NLP PROJ/wikihow/train.summary.txt', 'wb')

# go over row by row
for row in train_range:            
    abstract = Data.iloc[row,0]      # headline is the column representing the summary sentences
    article = Data.iloc[row,2]           # text is the column representing the article

    #  a threshold is used to remove short articles with long summaries as well as articles with no summary
    if len(abstract) < (0.75*len(article)):
        # remove extra commas in abstracts
        abstract = abstract.replace(".,",".")
        abstract = abstract.encode('utf-8')
        
        # a temporary file is created to initially write the summary, it is later used to separate the sentences of the summary
        with open('/content/drive/My Drive/NLP PROJ/wikihow/temporaryFile.txt','wb') as t:
            t.write(abstract)
        
        # summary sentences will be convert from seperate lines to a paragraph with each sentence started/ended with <t>/</t>
        with open('/content/drive/My Drive/NLP PROJ/wikihow/temporaryFile.txt','r') as t:
            for line in t:
                if line != "\n" and line != "\t" and line != " ":
                    line = line.lower()
                    line = re.sub(r'[\n]',"", line)

                    train_summary.write(b'<t> ')
                    train_summary.write(line.encode('utf-8'))
                    train_summary.write(b' </t> ')
            train_summary.write(b'\n')
            
train_summary.close()
################################################################################################################################

- validation set

In [0]:
################################################################################################################################
# path for article.txt
val_article  = open('/content/drive/My Drive/NLP PROJ/wikihow/val.article.txt', 'wb')

# go over row by row
for row in val_range:            
    abstract = Data.iloc[row,0]     
    article = Data.iloc[row,2]          

    #  a threshold is used to remove short articles with long summaries as well as articles with no summary
    if len(abstract) < (0.75*len(article)):
        # remove extra commas in articles
        article = re.sub(r'[.]+[\n]+[,]',". ", article)
        article = re.sub(r'[\n]'," ", article)
        article = article.encode('utf-8')
        
        # write to the file
        val_article.write(article)
        val_article.write(b'\n')    
            
val_article.close()
################################################################################################################################

################################################################################################################################
# path for summary.txt
val_summary  = open('/content/drive/My Drive/NLP PROJ/wikihow/val.summary.txt', 'wb')

# go over row by row
for row in val_range:            
    abstract = Data.iloc[row,0]      # headline is the column representing the summary sentences
    article = Data.iloc[row,2]           # text is the column representing the article

    #  a threshold is used to remove short articles with long summaries as well as articles with no summary
    if len(abstract) < (0.75*len(article)):
        # remove extra commas in abstracts
        abstract = abstract.replace(".,",".")
        abstract = abstract.encode('utf-8')
        
        # a temporary file is created to initially write the summary, it is later used to separate the sentences of the summary
        with open('/content/drive/My Drive/NLP PROJ/wikihow/temporaryFile.txt','wb') as t:
            t.write(abstract)
        
        # summary sentences will be convert from seperate lines to a paragraph with each sentence started/ended with <t>/</t>
        with open('/content/drive/My Drive/NLP PROJ/wikihow/temporaryFile.txt','r') as t:
            for line in t:
                if line != "\n" and line != "\t" and line != " ":
                    line = line.lower()
                    line = re.sub(r'[\n]',"", line)

                    val_summary.write(b'<t> ')
                    val_summary.write(line.encode('utf-8'))
                    val_summary.write(b' </t> ')
            val_summary.write(b'\n')
            
val_summary.close()
################################################################################################################################

- test set

In [0]:
################################################################################################################################
# path for article.txt
test_article  = open('/content/drive/My Drive/NLP PROJ/wikihow/test.article.txt', 'wb')

# go over row by row
for row in test_range:            
    abstract = Data.iloc[row,0]     
    article = Data.iloc[row,2]          

    #  a threshold is used to remove short articles with long summaries as well as articles with no summary
    if len(abstract) < (0.75*len(article)):
        # remove extra commas in articles
        article = re.sub(r'[.]+[\n]+[,]',". ", article)
        article = re.sub(r'[\n]'," ", article)
        article = article.encode('utf-8')
        
        # write to the file
        test_article.write(article)
        test_article.write(b'\n')    
            
test_article.close()
################################################################################################################################

################################################################################################################################
# path for summary.txt
test_summary  = open('/content/drive/My Drive/NLP PROJ/wikihow/test.summary.txt', 'wb')

# go over row by row
for row in test_range:            
    abstract = Data.iloc[row,0]      # headline is the column representing the summary sentences
    article = Data.iloc[row,2]           # text is the column representing the article

    #  a threshold is used to remove short articles with long summaries as well as articles with no summary
    if len(abstract) < (0.75*len(article)):
        # remove extra commas in abstracts
        abstract = abstract.replace(".,",".")
        abstract = abstract.encode('utf-8')
        
        # a temporary file is created to initially write the summary, it is later used to separate the sentences of the summary
        with open('/content/drive/My Drive/NLP PROJ/wikihow/temporaryFile.txt','wb') as t:
            t.write(abstract)
        
        # summary sentences will be convert from seperate lines to a paragraph with each sentence started/ended with <t>/</t>
        with open('/content/drive/My Drive/NLP PROJ/wikihow/temporaryFile.txt','r') as t:
            for line in t:
                if line != "\n" and line != "\t" and line != " ":
                    line = line.lower()
                    line = re.sub(r'[\n]',"", line)

                    test_summary.write(b'<t> ')
                    test_summary.write(line.encode('utf-8'))
                    test_summary.write(b' </t> ')
            test_summary.write(b'\n')
            
test_summary.close()
################################################################################################################################