# Feature Engineering on Text Data

In this notebook, we calculate features on data streamed from seppe.net in Preprocessing.ipynb. We calculate the following features on the data and columns in the extracted wiki_df dataframe:

- TF-IDF: Term Frequency - Inverse Document Frequency matrix is a feature which measures the occurrence of words normalized by their overall occurrence in the entire document corpus. We use this on the raw edits applied to each Wikipedia article to help gather features as to which words and terms in overall edits may lead to vandal edits or otherwise.
- LDA: Latent Dirichlet Analysis is a technique used in automated topic discovery. We use this on the overall Wiki text before edit to discover the original topic of the piece. The reason for using this feature is that some topics may be more susceptible to vandalism than others, such as political articles, for example.
- Leichtenstein Distance: This is used again on the raw edits to quantify the size of the edit. Usually large edits might correspond to large erasures or changes in a document text indicating vandalism and censoring of data from the public.

In [1]:
# Importing the feature transformation classes for doing TF-IDF 
from pyspark.ml.feature import HashingTF, Tokenizer, StopWordsRemover, CountVectorizer, IDF


In [2]:
%run "preprocessing.ipynb"

In [20]:
wiki_df = get_wiki_df()

get_label_count(wiki_df)

+------+--------+
| label|count(1)|
+------+--------+
|  safe|   16567|
|unsafe|    2291|
|vandal|     154|
+------+--------+



In [3]:
wiki_df.show(2)


+--------------------+-----+------------------+--------------------+--------------------+--------------------+--------------------+
|             comment|label|         name_user|            text_new|            text_old|          title_page|            url_page|
+--------------------+-----+------------------+--------------------+--------------------+--------------------+--------------------+
|→‎4 February:adde...| safe|SebastianRueckoldt|{{see also|Timeli...|{{see also|Timeli...|Timeline of the 2...|//en.wikipedia.or...|
|removing duplicat...| safe|Andreas Philopater|{{short descripti...|{{short descripti...|List of Art Deco ...|//en.wikipedia.or...|
+--------------------+-----+------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 2 rows



In [2]:
# Get clean dataframe (cleaning of comment, title_page, name_user):
clean_df = get_clean_df(wiki_df)

# In order to get the actual difference column
df_with_difference = get_difference_column(clean_df)

# Example of a difference column of the first of 20 instances:
# difference column is in the form : {removed: [...], added: [...]} in order to know which words were added and which were removed
print((df_with_difference.count(), len(df_with_difference.columns)))
print(df_with_difference.columns)

(19012, 7)
['label', 'comment', 'title_page', 'name_user', 'text_old', 'text_new', 'difference']


In [4]:
def tfIdf(df, count_method = 'hash'):
    """ This fucntion takes the text data and converts it into a term frequency-Inverse Document Frequency vector

    parameter: 
        count_method: Default = 'hash'. Determines whether to use featuer hashing or counts as the TF step for TF-IDF
    returns: dataframe with tf-idf vectors

    """

    # Carrying out the Tokenization of the text documents (splitting into words)
    tokenizer = Tokenizer(inputCol="text_new", outputCol="tokenised_text")
    tokensDf = tokenizer.transform(df)
    # Carrying out the StopWords Removal for TF-IDF
    stopwordsremover=StopWordsRemover(inputCol='tokenised_text',outputCol='words')
    swremovedDf= stopwordsremover.transform(tokensDf)

    if count_method == 'hash':
        # hashing is irreversible whereas counting is 
        # While applying HashingTF only needs a single pass to the data, applying IDF needs two passes:
        # First to compute the IDF vector and second to scale the term frequencies by IDF.
        hashingTF = HashingTF(inputCol="words", outputCol="tf_features")
        tfDf = hashingTF.transform(swremovedDf)
    else:
        # Creating Term Frequency Vector for each word
        cv=CountVectorizer(inputCol="words", outputCol="tf_features", vocabSize=300, minDF=2.0)
        cvModel=cv.fit(swremovedDf)
        tfDf=cvModel.transform(swremovedDf)

    # Carrying out Inverse Document Frequency on the TF data
    # spark.mllib's IDF implementation provides an option for ignoring terms
    # which occur in less than a minimum number of documents.
    # In such cases, the IDF for these terms is set to 0.
    # This feature can be used by passing the minDocFreq value to the IDF constructor.
    idf=IDF(inputCol="tf_features", outputCol="tf_idf_features")
    idfModel = idf.fit(tfDf)
    tfidfDf = idfModel.transform(tfDf)

    tfidfDf.cache().count()

    return tfidfDf


In [10]:
from pyspark.sql.functions import lit
fractions = wiki_df.select("label").distinct().withColumn("fraction", lit(0.8)).rdd.collectAsMap()
print(fractions) 
fractions = {'safe': 0.1, 'unsafe': 1.0, 'vandal':1.0}

seed = 42
# {2147481832: 0.8, 214748183: 0.8}
sampled_df = wiki_df.stat.sampleBy("label", fractions, seed)
sampled_df.show()

{'safe': 0.8, 'unsafe': 0.8, 'vandal': 0.8}
+--------------------+------+--------------------+--------------------+--------------------+--------------------+--------------------+
|             comment| label|           name_user|            text_new|            text_old|          title_page|            url_page|
+--------------------+------+--------------------+--------------------+--------------------+--------------------+--------------------+
|→‎General strike ...|  safe|  Twofingered Typist|{{short descripti...|{{short descripti...|2019–20 Hong Kong...|//en.wikipedia.or...|
|→‎International r...|  safe|             CRau080|{{short descripti...|{{short descripti...|2019–20 Hong Kong...|//en.wikipedia.or...|
|                    |  safe|           Promiseus|{{short descripti...|{{short descripti...|2019–20 Hong Kong...|//en.wikipedia.or...|
|       →‎December:gr|  safe|              TVSGuy|{{USTV year|2019}...|{{USTV year|2019}...|2019 in American ...|//en.wikipedia.or...|
|→‎Italy:te

In [11]:
get_label_count(sampled_df)

+------+--------+
| label|count(1)|
+------+--------+
|  safe|    1680|
|unsafe|    2291|
|vandal|     154|
+------+--------+



## Calculate TF-IDF via Spark

In [12]:
tfidfDf=tfIdf(sampled_df)

In [17]:
tfidfDf.select("tf_idf_features").show()

+--------------------+
|     tf_idf_features|
+--------------------+
|(262144,[14,115,1...|
|(262144,[14,115,1...|
|(262144,[14,115,1...|
|(262144,[4,15,29,...|
|(262144,[8,14,60,...|
|(262144,[8,14,60,...|
|(262144,[11,120,2...|
|(262144,[14,90,13...|
|(262144,[13,14,15...|
|(262144,[3,15,62,...|
|(262144,[14,15,83...|
|(262144,[14,20,30...|
|(262144,[112,211,...|
|(262144,[11,15,22...|
|(262144,[14,67,97...|
|(262144,[112,211,...|
|(262144,[15,24,13...|
|(262144,[36,41,21...|
|(262144,[14,127,1...|
|(262144,[14,115,1...|
+--------------------+
only showing top 20 rows



In [18]:
tfidfDf.select("text_new").show()

+--------------------+
|            text_new|
+--------------------+
|{{short descripti...|
|{{short descripti...|
|{{short descripti...|
|{{USTV year|2019}...|
|{{pp-protected|sm...|
|{{pp-protected|sm...|
|{{short descripti...|
|{{short descripti...|
|{{about|the 2013 ...|
|{{pp-protected|sm...|
|{{pp-move-indef}}...|
|{{Use mdy dates|d...|
|{{short descripti...|
|{{short descripti...|
|{{short descripti...|
|{{short descripti...|
|{{For|related rac...|
|{{short descripti...|
|{{pp-protected|sm...|
|{{short descripti...|
+--------------------+
only showing top 20 rows

