### Description:
In this notebook, my goal is to upload all data and do some preprocessing <br>

In **Stage 1: Import data**:<br>
I import all 6 text files into one single dataframe **docs_df** with two columns, <br>
where column **Content** has lines of texts and column **FileName** shows the source file name.<br>

In **Stage 2: Tokenize files into sentences**:<br>
I tokenize each line from column **Content** into sentences and <br>
store in new dataframe **docs_df_new** with two columns,<br>
where column **Sentence** stores one single tokenized sentence in one row,<br>
and column **FileName** stores source file names of each sentence.<br>

In **Stage 3: Process sentences**:<br>
I do process sentences in column **Sentence** of **docs_df_new**. I do two different processings.<br>
First processing is where I lowercase sentence and remove stopwords, <br>
this results are stored in new column **Clean sentence**<br>
Second processing is where I lowercase sentence and remove stopwords and replace each word to it's lemma, <br>
this results are stored in new column **Lemmatized**<br>

In [1]:
import pandas as pd
import glob
import spacy

### Stage 1: Import data

In [2]:
# 1
# Get file names
main_dir  = 'test docs/'
file_pathes = glob.glob(main_dir + "*.txt")

f_names = [fp.split("/")[1] for fp in file_pathes]  # remove main_dir part
f_names = sorted(f_names)
print(f_names)

['doc1.txt', 'doc2.txt', 'doc3.txt', 'doc4.txt', 'doc5.txt', 'doc6.txt']


In [3]:
# 2.1
# Import data from files
dfs = []
for f in f_names:
    df = pd.read_csv(main_dir+f, sep="\t", names=["Content"])      # Import fie (csv,txt, ...)
    df['FileName'] = f
    dfs.append(df)
    
# 2.2
# Combine data into single frame
docs_df = pd.concat(dfs, axis=0, ignore_index=True)

# 2.3
# Swap columns
docs_df=docs_df.reindex(columns=["FileName", "Content"])

In [4]:
print("DF shape:", docs_df.shape)
docs_df.head()

DF shape: (366, 2)


Unnamed: 0,FileName,Content
0,doc1.txt,Let me begin by saying thanks to all you who'v...
1,doc1.txt,We all made this journey for a reason. It's hu...
2,doc1.txt,That's the journey we're on today. But let me ...
3,doc1.txt,My work took me to some of Chicago's poorest n...
4,doc1.txt,It was in these neighborhoods that I received ...


### Stage 2: Tokenize files into sentences

In [5]:
# Load English model
nlp = spacy.load('en_core_web_sm')

# Add the component to the pipeline
nlp.add_pipe('sentencizer', before="parser")

<spacy.pipeline.sentencizer.Sentencizer at 0x11e074e40>

In [6]:
# Create empty dataframe
columns = ['FileName', 'Sentence']
docs_df_new = pd.DataFrame(columns=columns)

In [7]:
# Add sentences to dataframe
for row in docs_df.iterrows():
    sentences = nlp(row[1]["Content"]).sents

    for s in sentences:
        docs_df_new = docs_df_new.append({'FileName': row[1]['FileName'], 'Sentence': s.text}, ignore_index=True)


In [8]:
print("DF shape:", docs_df_new.shape)
docs_df_new.head()

DF shape: (941, 2)


Unnamed: 0,FileName,Sentence
0,doc1.txt,Let me begin by saying thanks to all you who'v...
1,doc1.txt,We all made this journey for a reason.
2,doc1.txt,"It's humbling, but in my heart I know you didn..."
3,doc1.txt,"In the face of war, you believe there can be p..."
4,doc1.txt,"In the face of despair, you believe there can ..."


### Stage 3: Process sentences: 
Remove stopwords <br>
Take word lemmas

In [9]:
# Helper functions
def remove_stopwords(sent_):
    # 4.2
    # Apply model on docs
    docs_sent = nlp(sent_)

    # 4.3    
    # Get stopwords
    stopwords = spacy.lang.en.stop_words.STOP_WORDS

    # 4.4
    # Take cleaned lemmatized words
    proc_tokens = []
    for token in docs_sent:
        if token.lemma_ not in stopwords and \
        not token.is_punct:
            proc_tokens.append(token.text.lower())
            
    return " ".join(proc_tokens)

def remove_stopwords_and_lemmatize(sent_):
    # 4.2
    # Apply model on docs
    docs_sent = nlp(sent_)

    # 4.3    
    # Get stopwords
    stopwords = spacy.lang.en.stop_words.STOP_WORDS

    # 4.4
    # Take cleaned lemmatized words
    proc_tokens = []
    for token in docs_sent:
        if token.lemma_ not in stopwords and \
        not token.is_punct:
            proc_tokens.append(token.lemma_.lower())
            
    return " ".join(proc_tokens)

In [10]:
# Apply helper functions
docs_df_new["Clean sentence"] = docs_df_new["Sentence"].apply(lambda x: remove_stopwords(x))
docs_df_new["Lemmatized"] = docs_df_new["Sentence"].apply(lambda x: remove_stopwords_and_lemmatize(x))
docs_df_new.shape

(941, 4)

In [11]:
docs_df_new.head()

Unnamed: 0,FileName,Sentence,Clean sentence,Lemmatized
0,doc1.txt,Let me begin by saying thanks to all you who'v...,let me begin thanks traveled far wide brave co...,let I begin thank travel far wide brave cold t...
1,doc1.txt,We all made this journey for a reason.,journey reason,journey reason
2,doc1.txt,"It's humbling, but in my heart I know you didn...",humbling heart i know come me came believe cou...,humble heart I know come I come believe country
3,doc1.txt,"In the face of war, you believe there can be p...",face war believe peace,face war believe peace
4,doc1.txt,"In the face of despair, you believe there can ...",face despair believe hope,face despair believe hope


In [12]:
# Export file
docs_df_new.to_csv("Processed_data.csv", index=None)