# Preprocessing 

All of these steps are optional. Finding the right way to preprocess your text data will be accomplished through trial and error. Additional, more targeted preprocessing may be necessary for each subset if you notice abnormalities that might affect your clustering (like ID numbers, specific formatting, etc.) My advice would be to use the Description and any other text data available to you if and when it makes sense to do so. The more text, the better, but only if it provides new and relevant information. In other words, don’t inject noise for the sake of adding more text data. 

Removing stop words, e.g. “the”, “and”, etc.  

You can also create your own list of common words to remove if you find that your clusters are being impacted by words that aren’t really relevant (this will be revealed when you do key word extraction on your clusters). For example, you might find that you get clusters of tickets that all have vague words like “requirements” or “user.” While the language in those clusters is similar, it’s not really accomplishing the task we’re setting out to do, so you may want to consider removing these types of words from the text. You may also consider removing all Components and system names from the text. 

Removing punctuation and/or digits 

Lower casing 

Lemmatizing  

This will convert all words to their base form. For example, it would change “running” to “run” and “better” to “good.” This is more robust than stemming. 

If you’re using Doc2Vec or another algorithm that takes context into consideration, you may want to skip this step. Trial and error will help you make that decision. 

In [None]:
# import packages
import pandas as pd
import utils
import pickle

# Import Processed Requirements Data from saved CSV

In [None]:
df = pd.read_csv('dummy_requirements.csv')

# Preprocess Text for NLP evaluation
## minimal preprocessing for Doc2Vec Vectorization step

In [None]:
df = df.astype(str)
my_stop_words = [] # minimal preprocessing. do not remove stopwords or lemmatize for doc2vec

df['Requirement'] = df.apply(lambda x: utils.preprocess_text(x['Description'], my_stop_words,
                                                             keep_original=True, display=False, 
                                                             pos_tags=True), axis=1)
df

In [None]:
# uncomment after debugging POS - will need to pull POS field out of Requirement field...
df['POS'] = df.apply(lambda x: x['Requirement'][1], axis=1)
df['Requirement'] = df.apply(lambda x: x['Requirement'][0], axis=1)

In [None]:
df

In [None]:
with open("cleaned_dummy_doc2vec.pickle", "wb") as pickle_file:
    pickle.dump(df, pickle_file, protocol=pickle.HIGHEST_PROTOCOL)

# View list of Components

In [None]:
# Number of Requirements by Component/Program of Record
components = df.groupby(by='Component/s')
components.size()

# Subgroup by component

In [None]:
groups = components.groups.keys()
grouped_data = {}
for g in groups:
    print(g)
    sub_df = components.get_group(g)
    sub_df = sub_df.astype(str)
    X = sub_df
    grouped_data[g] = X
    print(X)

In [None]:
grouped_data

In [None]:
with open("DummyPreproccessedForDoc2Vec.pickle", "wb") as pickle_file:
    pickle.dump(grouped_data, pickle_file, protocol=pickle.HIGHEST_PROTOCOL)