# Plagiarism Detection - NLP Feature Engineering

## Steps:
* Clean and pre-process the data.
* Define features for comparing the similarity of an answer text and a source text, and extract similarity features.
* Select "good" features, by analyzing the correlations between different features.
* Create train/test `.csv` files that hold the relevant features and class labels for train/test data points.

These similarity features come from the plagiarism research done in [this paper](https://s3.amazonaws.com/video.udacity-data.com/topher/2019/January/5c412841_developing-a-corpus-of-plagiarised-short-answers/developing-a-corpus-of-plagiarised-short-answers.pdf)

In [1]:
# import libraries
import pandas as pd
import numpy as np
import os

This plagiarism dataset is made of multiple text files; each of these files has characteristics that are is summarized in a `.csv` file named `file_information.csv`, which I read in using `pandas`.

In [2]:
csv_file = 'data/file_information.csv'
plagiarism_df = pd.read_csv(csv_file)

# print out the first few rows of data info
plagiarism_df.head()

Unnamed: 0,File,Task,Category
0,g0pA_taska.txt,a,non
1,g0pA_taskb.txt,b,cut
2,g0pA_taskc.txt,c,light
3,g0pA_taskd.txt,d,heavy
4,g0pA_taske.txt,e,non


## Types of Plagiarism

Each text file is associated with one **Task** (task A-E) and one **Category** of plagiarism, which you can see in the above DataFrame.

###  Tasks, A-E

Each text file contains an answer to one short question; these questions are labeled as tasks A-E. For example, Task A asks the question: "What is inheritance in object oriented programming?"

### Categories of plagiarism 

Each text file has an associated plagiarism label/category:

**1. Plagiarized categories: `cut`, `light`, and `heavy`.**
* These categories represent different levels of plagiarized answer texts. `cut` answers copy directly from a source text, `light` answers are based on the source text but include some light rephrasing, and `heavy` answers are based on the source text, but *heavily* rephrased (and will likely be the most challenging kind of plagiarism to detect).
     
**2. Non-plagiarized category: `non`.** 
* `non` indicates that an answer is not plagiarized; the Wikipedia source text is not used to create this answer.
    
**3. Special, source text category: `orig`.**
* This is a specific category for the original, Wikipedia source text. We will use these files only for comparison purposes.

## Pre-Process the Data

### Convert categorical to numerical data

My goal is to create a binary classifier and so Ill create a binary class label that indicates whether an answer text is plagiarized (1) or not (0).  

The data frame will have these properties
* 4 columns: `File`, `Task`, `Category`, `Class`. The `File` and `Task` columns can remain unchanged from the original `.csv` file.
* Convert all `Category` labels to numerical labels according to the following rules (a higher value indicates a higher degree of plagiarism):
    * 0 = `non`
    * 1 = `heavy`
    * 2 = `light`
    * 3 = `cut`
    * -1 = `orig`, this is a special value that indicates an original file.
* For the new `Class` column
    * Any answer text that is not plagiarized (`non`) should have the class label `0`. 
    * Any plagiarized answer texts should have the class label `1`. 
    * And any `orig` texts will have a special label `-1`. 

### Expected output

In [5]:
# Read in a csv file and return a transformed dataframe
def numerical_dataframe(csv_file='data/file_information.csv'):
    plagiarism_df = pd.read_csv(csv_file)
    print(plagiarism_df.head())
    
    #Defining new column, which is identical to category column
    plagiarism_df['Class'] = plagiarism_df['Category']
    
    #changing Catergory column as per cate table
    cate = {'non':0 , 'heavy':1,'light':2,'cut':3,'orig':-1}
    plagiarism_df['Category'].replace(cate, inplace = True)
    
  #  print("After Category Renaming")
   # print(plagiarism_df.head())
    
    #Changing Class as per porvided dict
    plagiarism_df['Class'].replace({'non':0 , 'heavy':1,'light':1,'cut':1,'orig':-1}, inplace = True)
    
  #  print("After Class Renaming")
  # print(plagiarism_df.head())
    
    return plagiarism_df
    
#numerical_dataframe(csv_file)


### Test cells

Below are a couple of test cells. The first is an informal test
The **second** cell below is a more rigorous test cell. The goal of a cell like this is to ensure that my code is working as expected


In [6]:
# informal testing, print out the results of a called function
# create new `transformed_df`
transformed_df = numerical_dataframe(csv_file ='data/file_information.csv')

# check work
# check that all categories of plagiarism have a class label = 1
transformed_df.head(10)

             File Task Category
0  g0pA_taska.txt    a      non
1  g0pA_taskb.txt    b      cut
2  g0pA_taskc.txt    c    light
3  g0pA_taskd.txt    d    heavy
4  g0pA_taske.txt    e      non


Unnamed: 0,File,Task,Category,Class
0,g0pA_taska.txt,a,0,0
1,g0pA_taskb.txt,b,3,1
2,g0pA_taskc.txt,c,2,1
3,g0pA_taskd.txt,d,1,1
4,g0pA_taske.txt,e,0,0
5,g0pB_taska.txt,a,0,0
6,g0pB_taskb.txt,b,0,0
7,g0pB_taskc.txt,c,3,1
8,g0pB_taskd.txt,d,2,1
9,g0pB_taske.txt,e,1,1


In [8]:
# test cell that creates `transformed_df`, if tests are passed

# importing tests
import problem_unittests as tests

# test numerical_dataframe function
tests.test_numerical_df(numerical_dataframe)

# if above test is passed, create NEW `transformed_df`
transformed_df = numerical_dataframe(csv_file ='data/file_information.csv')

# check work
print('\nExample data: ')
transformed_df.head()

             File Task Category
0  g0pB_taske.txt    e    heavy
1  g0pC_taska.txt    a    heavy
2  g0pC_taskb.txt    b      non
3  g0pC_taskc.txt    c      non
4  g0pC_taskd.txt    d      cut
Tests Passed!
             File Task Category
0  g0pA_taska.txt    a      non
1  g0pA_taskb.txt    b      cut
2  g0pA_taskc.txt    c    light
3  g0pA_taskd.txt    d    heavy
4  g0pA_taske.txt    e      non

Example data: 


Unnamed: 0,File,Task,Category,Class
0,g0pA_taska.txt,a,0,0
1,g0pA_taskb.txt,b,3,1
2,g0pA_taskc.txt,c,2,1
3,g0pA_taskd.txt,d,1,1
4,g0pA_taske.txt,e,0,0


## Text Processing & Splitting Data

In [9]:
import helpers 

# create a text column 
text_df = helpers.create_text_column(transformed_df)
text_df.head()

Unnamed: 0,File,Task,Category,Class,Text
0,g0pA_taska.txt,a,0,0,inheritance is a basic concept of object orien...
1,g0pA_taskb.txt,b,3,1,pagerank is a link analysis algorithm used by ...
2,g0pA_taskc.txt,c,2,1,the vector space model also called term vector...
3,g0pA_taskd.txt,d,1,1,bayes theorem was names after rev thomas bayes...
4,g0pA_taske.txt,e,0,0,dynamic programming is an algorithm design tec...


In [10]:
# after running the cell above
# check out the processed text for a single file, by row index
row_idx = 0

sample_text = text_df.iloc[0]['Text']

print('Sample processed text:\n\n', sample_text)

Sample processed text:

 inheritance is a basic concept of object oriented programming where the basic idea is to create new classes that add extra detail to existing classes this is done by allowing the new classes to reuse the methods and variables of the existing classes and new methods and classes are added to specialise the new class inheritance models the is kind of relationship between entities or objects  for example postgraduates and undergraduates are both kinds of student this kind of relationship can be visualised as a tree structure where student would be the more general root node and both postgraduate and undergraduate would be more specialised extensions of the student node or the child nodes  in this relationship student would be known as the superclass or parent class whereas  postgraduate would be known as the subclass or child class because the postgraduate class extends the student class  inheritance can occur on several layers where if visualised would display a l

## Split data into training and test sets

The next cell will add a `Datatype` column to a given DataFrame to indicate if the record is: 
* `train` - Training data, for model training.
* `test` - Testing data, for model evaluation.
* `orig` - The task's original answer from wikipedia.

### Stratified sampling

The given code uses a helper function which you can view in the `helpers.py` file in the main project directory. This implements [stratified random sampling](https://en.wikipedia.org/wiki/Stratified_sampling) to randomly split data by task & plagiarism amount. Stratified sampling ensures that we get training and test data that is fairly evenly distributed across task & plagiarism combinations. Approximately 26% of the data is held out for testing and 74% of the data is used for training.

The function **train_test_dataframe** takes in a DataFrame that it assumes has `Task` and `Category` columns, and, returns a modified frame that indicates which `Datatype` (train, test, or orig) a file falls into. This sampling will change slightly based on a passed in *random_seed*. Due to a small sample size, this stratified random sampling will provide more stable results for a binary plagiarism classifier. Stability here is smaller *variance* in the accuracy of classifier, given a random seed.

In [11]:
random_seed = 1 # can change; set for reproducibility

import helpers

# create new df with Datatype (train, test, orig) column
# pass in `text_df` from above to create a complete dataframe, with all the information you need
complete_df = helpers.train_test_dataframe(text_df, random_seed=random_seed)

# check results
complete_df.head(10)

Unnamed: 0,File,Task,Category,Class,Text,Datatype
0,g0pA_taska.txt,a,0,0,inheritance is a basic concept of object orien...,train
1,g0pA_taskb.txt,b,3,1,pagerank is a link analysis algorithm used by ...,test
2,g0pA_taskc.txt,c,2,1,the vector space model also called term vector...,train
3,g0pA_taskd.txt,d,1,1,bayes theorem was names after rev thomas bayes...,train
4,g0pA_taske.txt,e,0,0,dynamic programming is an algorithm design tec...,train
5,g0pB_taska.txt,a,0,0,inheritance is a basic concept in object orien...,train
6,g0pB_taskb.txt,b,0,0,pagerank pr refers to both the concept and the...,train
7,g0pB_taskc.txt,c,3,1,vector space model is an algebraic model for r...,test
8,g0pB_taskd.txt,d,2,1,bayes theorem relates the conditional and marg...,train
9,g0pB_taske.txt,e,1,1,dynamic programming is a method for solving ma...,test


# Determining Plagiarism


# Similarity Features 

One of the ways we might go about detecting plagiarism, is by computing **similarity features** that measure how similar a given answer text is as compared to the original wikipedia source text (for a specific task, a-e). The similarity features I will use are informed by [this paper on plagiarism detection](https://s3.amazonaws.com/video.udacity-data.com/topher/2019/January/5c412841_developing-a-corpus-of-plagiarised-short-answers/developing-a-corpus-of-plagiarised-short-answers.pdf). 
> In this paper, researchers created features called **containment** and **longest common subsequence**.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

# Calculate the ngram containment for one answer file/source file pair in a df
def calculate_containment(df, n, answer_filename):
      #print(df)
   # print(n)
    #print(answer_filename)
    
    row_1 = df.loc[df['File'] == answer_filename]
    #print("temp is: \n",row_1)
    
    #Copy the text for selecte index
    task = row_1['Task'].values[0]
    #print("Task for row_1 is: ",task)
    
    answer_text = row_1['Text'].values[0]
    
    #DataYpe sohould be original + task should be same as passed
    source_text = df.loc[ 
                    (df['Datatype'] == 'orig') &
                    (df['Task'] == task)
                    ]['Text'].values[0]
    
    #print("anser_text: \n",answer_text)
    #print("sol_text: \n",source_text)
    
     # instantiate an ngram counter
    counts = CountVectorizer(analyzer='word', ngram_range=(n,n))
    
     # word to int
    vocab2int = counts.fit([answer_text,source_text]).vocabulary_
    #print("DICT is\n",vocab2int)
    
    # create array of n-gram counts for the answer and source text
    ngram_array = counts.fit_transform([answer_text, source_text]).toarray()
    #print("ngrams created is:\n")
    #print(ngram_array)
    
    intersection_list = np.amin(ngram_array,axis=0)
    intersection = np.sum(intersection_list)
    
    answer_cnt = np.sum(ngram_array[0])
    containment_val = intersection / answer_cnt;
    
    
    return containment_val

filename = complete_df.loc[0, 'File']
calculate_containment(complete_df, 1, filename)

0.39814814814814814

### Test cells
The cell below iterates through the first few files, and calculates the original category _and_ containment values for a specified n and file.

In [15]:
# select a value for n
n = 3

# indices for first few files
test_indices = range(5)

# iterate through files and calculate containment
category_vals = []
containment_vals = []
for i in test_indices:
    # get level of plagiarism for a given file index
    category_vals.append(complete_df.loc[i, 'Category'])
    # calculate containment for given file and n
    filename = complete_df.loc[i, 'File']
    c = calculate_containment(complete_df, n, filename)
    containment_vals.append(c)

# print out result, does it make sense?
print('Original category values: \n', category_vals)
print()
print(str(n)+'-gram containment values: \n', containment_vals)

Original category values: 
 [0, 3, 2, 1, 0]

3-gram containment values: 
 [0.009345794392523364, 0.9641025641025641, 0.6136363636363636, 0.15675675675675677, 0.031746031746031744]


In [16]:
# test containment calculation
# params: complete_df from before, and containment function
tests.test_containment(complete_df, calculate_containment)

Tests Passed!


## Longest Common Subsequence
Containment a good way to find overlap in word usage between two documents; it may help identify cases of cut-and-paste as well as paraphrased levels of plagiarism. Since plagiarism is a fairly complex task with varying levels, it's often useful to include other measures of similarity. The paper also discusses a feature called **longest common subsequence**.

> The longest common subsequence is the longest string of words (or letters) that are *the same* between the Wikipedia Source Text (S) and the Student Answer Text (A). This value is also normalized by dividing by the total number of words (or letters) in the  Student Answer Text. 

In [17]:
# Compute the normalized LCS given an answer text and a source text
def lcs_norm_word(answer_text, source_text):
  #print(answer_text[1:10])
    
    answer_str = answer_text.split()
    
    #print("After Split: ")
    #print(answer_str)
    
    source_str = source_text.split()
    
    NRow = len(answer_str)
    NColumn = len(source_str)
    
    #matrix = [[0 for i in range(NRow+1)] for j in range(NColumn+1)]
    #2D matrix of LCS calulcation
    matrix = np.zeros((NRow + 1, NColumn + 1), dtype=int)
    #print("answer_str: ",answer_str)
    #print("Source_str: ",answer_str)
    
    for i in range(1,NRow+1):
        for j in range(1,NColumn+1):
            if answer_str[i-1] == source_str[j-1]:
                matrix[i][j] = matrix[i-1][j-1] + 1
            else:
                matrix[i][j] = max(matrix[i-1][j], matrix[i][j-1])
        
    
    ans = matrix[NRow][NColumn]
        
    return ans / NRow

### Test cells

In [18]:
A = "i think pagerank is a link analysis algorithm used by google that uses a system of weights attached to each element of a hyperlinked set of documents"
S = "pagerank is a link analysis algorithm used by the google internet search engine that assigns a numerical weighting to each element of a hyperlinked set of documents"

# calculate LCS
lcs = lcs_norm_word(A, S)
print('LCS = ', lcs)


# expected value test
assert lcs==20/27., "Incorrect LCS value, expected about 0.7408, got "+str(lcs)

print('Test passed!')

LCS =  0.7407407407407407
Test passed!


This next cell runs a more rigorous test.

In [19]:
# test lcs implementation
# params: complete_df from before, and lcs_norm_word function
tests.test_lcs(complete_df, lcs_norm_word)

Tests Passed!


In [20]:
test_indices = range(5) # look at first few files

category_vals = []
lcs_norm_vals = []
# iterate through first few docs and calculate LCS
for i in test_indices:
    category_vals.append(complete_df.loc[i, 'Category'])
    # get texts to compare
    answer_text = complete_df.loc[i, 'Text'] 
    task = complete_df.loc[i, 'Task']
    # we know that source texts have Class = -1
    orig_rows = complete_df[(complete_df['Class'] == -1)]
    orig_row = orig_rows[(orig_rows['Task'] == task)]
    source_text = orig_row['Text'].values[0]
    
    # calculate lcs
    lcs_val = lcs_norm_word(answer_text, source_text)
    lcs_norm_vals.append(lcs_val)

# print out result, does it make sense?
print('Original category values: \n', category_vals)
print()
print('Normalized LCS values: \n', lcs_norm_vals)

Original category values: 
 [0, 3, 2, 1, 0]

Normalized LCS values: 
 [0.1917808219178082, 0.8207547169811321, 0.8464912280701754, 0.3160621761658031, 0.24257425742574257]


# Create All Features

In [21]:
# Function returns a list of containment features, calculated for a given n 
# Should return a list of length 100 for all files in a complete_df
def create_containment_features(df, n, column_name=None):
    
    containment_values = []
    
    if(column_name==None):
        column_name = 'c_'+str(n) # c_1, c_2, .. c_n
    
    # iterates through dataframe rows
    for i in df.index:
        file = df.loc[i, 'File']
        # Computes features using calculate_containment function
        if df.loc[i,'Category'] > -1:
            c = calculate_containment(df, n, file)
            containment_values.append(c)
        # Sets value to -1 for original tasks 
        else:
            containment_values.append(-1)
    
    print(str(n)+'-gram containment features created!')
    return containment_values


### Creating LCS features

In [22]:
# Function creates lcs feature and add it to the dataframe
def create_lcs_features(df, column_name='lcs_word'):
    
    lcs_values = []
    
    # iterate through files in dataframe
    for i in df.index:
        # Computes LCS_norm words feature using function above for answer tasks
        if df.loc[i,'Category'] > -1:
            # get texts to compare
            answer_text = df.loc[i, 'Text'] 
            task = df.loc[i, 'Task']
            # we know that source texts have Class = -1
            orig_rows = df[(df['Class'] == -1)]
            orig_row = orig_rows[(orig_rows['Task'] == task)]
            source_text = orig_row['Text'].values[0]

            # calculate lcs
            lcs = lcs_norm_word(answer_text, source_text)
            lcs_values.append(lcs)
        # Sets to -1 for original tasks 
        else:
            lcs_values.append(-1)

    print('LCS features created!')
    return lcs_values
    

## Create a features DataFrame by selecting an `ngram_range`

The paper suggests calculating the following features: containment *1-gram to 5-gram* and *longest common subsequence*. 
> In this exercise, you can choose to create even more features, for example from *1-gram to 7-gram* containment features and *longest common subsequence*. 

In [23]:
# Define an ngram range
ngram_range = range(1,7)

# The following code may take a minute to run, depending on your ngram_range
features_list = []

# Create features in a features_df
all_features = np.zeros((len(ngram_range)+1, len(complete_df)))

# Calculate features for containment for ngrams in range
i=0
for n in ngram_range:
    column_name = 'c_'+str(n)
    features_list.append(column_name)
    # create containment features
    all_features[i]=np.squeeze(create_containment_features(complete_df, n))
    i+=1

# Calculate features for LCS_Norm Words 
features_list.append('lcs_word')
all_features[i]= np.squeeze(create_lcs_features(complete_df))

# create a features dataframe
features_df = pd.DataFrame(np.transpose(all_features), columns=features_list)

# Print all features/columns
print()
print('Features: ', features_list)
print()

1-gram containment features created!
2-gram containment features created!
3-gram containment features created!
4-gram containment features created!
5-gram containment features created!
6-gram containment features created!
LCS features created!

Features:  ['c_1', 'c_2', 'c_3', 'c_4', 'c_5', 'c_6', 'lcs_word']



In [24]:
# print some results 
features_df.head(10)

Unnamed: 0,c_1,c_2,c_3,c_4,c_5,c_6,lcs_word
0,0.398148,0.07907,0.009346,0.0,0.0,0.0,0.191781
1,1.0,0.984694,0.964103,0.943299,0.92228,0.901042,0.820755
2,0.869369,0.719457,0.613636,0.515982,0.449541,0.382488,0.846491
3,0.593583,0.268817,0.156757,0.108696,0.081967,0.06044,0.316062
4,0.544503,0.115789,0.031746,0.005319,0.0,0.0,0.242574
5,0.329502,0.053846,0.007722,0.003876,0.0,0.0,0.161172
6,0.590308,0.150442,0.035556,0.004464,0.0,0.0,0.301653
7,0.765306,0.709898,0.664384,0.62543,0.589655,0.553633,0.621711
8,0.759777,0.505618,0.39548,0.306818,0.245714,0.195402,0.484305
9,0.884444,0.526786,0.340807,0.247748,0.180995,0.15,0.597458


## Correlated Features
Here I choose my features based on which pairings have the lowest correlation. These correlation values range between 0 and 1; from low to high correlation, and are displayed in a [correlation matrix](https://www.displayr.com/what-is-a-correlation-matrix/), below.

In [25]:
# Create correlation matrix for just Features to determine different models to test
corr_matrix = features_df.corr().abs().round(2)

# display shows all of a dataframe
display(corr_matrix)

Unnamed: 0,c_1,c_2,c_3,c_4,c_5,c_6,lcs_word
c_1,1.0,0.94,0.9,0.89,0.88,0.87,0.97
c_2,0.94,1.0,0.99,0.98,0.97,0.96,0.98
c_3,0.9,0.99,1.0,1.0,0.99,0.98,0.97
c_4,0.89,0.98,1.0,1.0,1.0,0.99,0.95
c_5,0.88,0.97,0.99,1.0,1.0,1.0,0.95
c_6,0.87,0.96,0.98,0.99,1.0,1.0,0.94
lcs_word,0.97,0.98,0.97,0.95,0.95,0.94,1.0


## Create selected train/test data

In [26]:
# Takes in dataframes and a list of selected features (column names) 
# and returns (train_x, train_y), (test_x, test_y)
def train_test_data(complete_df, features_df, selected_features):
    '''Gets selected training and test features from given dataframes, and 
       returns tuples for training and test features and their corresponding class labels.
       :param complete_df: A dataframe with all of our processed text data, datatypes, and labels
       :param features_df: A dataframe of all computed, similarity features
       :param selected_features: An array of selected features that correspond to certain columns in `features_df`
       :return: training and test features and labels: (train_x, train_y), (test_x, test_y)'''
    
    # print("Below is feathure df:")
   # print(features_df.head())
   # print("Now Prointing completed_df: ")
   # print(complete_df.head())
    
    
    df = pd.concat([complete_df, features_df[selected_features]], axis=1)    
   # print("Now priniting the concat df:")
   # print(df.head())
    
    train_df = df[df.Datatype == 'train']
    #test_df = df['test'] ==> QUESTION: WHy this is failing i., differnce be df[df.DataTyep == 'train '] vs df['train']
    test_df = df[df.Datatype == 'test']
    
    # get the training features
    #train_x = train_df[df.Datatype == selected_features].values ==> Why not working
    train_x = train_df[selected_features].values
    # And training class labels (0 or 1)
    train_y = train_df['Class'].values
    
    # get the test features and labels
    test_x =  test_df[selected_features].values
    test_y =  test_df['Class'].values
    
    return (train_x, train_y), (test_x, test_y)

### Test cells

Below, test out your implementation and create the final train/test data.

In [27]:
test_selection = list(features_df)[:2] # first couple columns as a test
# test that the correct train/test data is created
(train_x, train_y), (test_x, test_y) = train_test_data(complete_df, features_df, test_selection)

# params: generated train/test data
tests.test_data_split(train_x, train_y, test_x, test_y)

Tests Passed!


## Select "good" features

In [28]:
# Select your list of features, this should be column names from features_df
# ex. ['c_1', 'lcs_word']
selected_features = ['c_1', 'c_5', 'lcs_word']

(train_x, train_y), (test_x, test_y) = train_test_data(complete_df, features_df, selected_features)

# check that division of samples seems correct
# these should add up to 95 (100 - 5 original files)
print('Training size: ', len(train_x))
print('Test size: ', len(test_x))
print()
print('Training df sample: \n', train_x[:10])

Training size:  70
Test size:  25

Training df sample: 
 [[0.39814815 0.         0.19178082]
 [0.86936937 0.44954128 0.84649123]
 [0.59358289 0.08196721 0.31606218]
 [0.54450262 0.         0.24257426]
 [0.32950192 0.         0.16117216]
 [0.59030837 0.         0.30165289]
 [0.75977654 0.24571429 0.48430493]
 [0.51612903 0.         0.27083333]
 [0.44086022 0.         0.22395833]
 [0.97945205 0.78873239 0.9       ]]



## Create csv files

In [31]:
def make_csv(x, y, filename, data_dir):
  
    # make data dir, if it does not exist
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    
    df = pd.concat([pd.DataFrame(y), pd.DataFrame(x)], axis=1)
    print("concaded df is")
    print(df.head())
    
    print("Now Saving")
    
    df.to_csv(os.path.join(data_dir, filename), header=False, index=False)
    
    # nothing is returned, but a print statement indicates that the function has run
    print('Path created: '+str(data_dir)+'/'+str(filename))

### Test cells

In [32]:
fake_x = [ [0.39814815, 0.0001, 0.19178082], 
           [0.86936937, 0.44954128, 0.84649123], 
           [0.44086022, 0., 0.22395833] ]

fake_y = [0, 1, 1]

make_csv(fake_x, fake_y, filename='to_delete.csv', data_dir='test_csv')

# read in and test dimensions
fake_df = pd.read_csv('test_csv/to_delete.csv', header=None)

# check shape
assert fake_df.shape==(3, 4), \
      'The file should have as many rows as data_points and as many columns as features+1 (for indices).'
# check that first column = labels
assert np.all(fake_df.iloc[:,0].values==fake_y), 'First column is not equal to the labels, fake_y.'
print('Tests passed!')

concaded df is
   0         0         1         2
0  0  0.398148  0.000100  0.191781
1  1  0.869369  0.449541  0.846491
2  1  0.440860  0.000000  0.223958
Now Saving
Path created: test_csv/to_delete.csv
Tests passed!


In [33]:
# delete the test csv file, generated above
! rm -rf test_csv

## uploading this data to S3.

In [34]:
# can change directory, if you want
data_dir = 'plagiarism_data'

make_csv(train_x, train_y, filename='train.csv', data_dir=data_dir)
make_csv(test_x, test_y, filename='test.csv', data_dir=data_dir)

concaded df is
   0         0         1         2
0  0  0.398148  0.000000  0.191781
1  1  0.869369  0.449541  0.846491
2  1  0.593583  0.081967  0.316062
3  0  0.544503  0.000000  0.242574
4  0  0.329502  0.000000  0.161172
Now Saving
Path created: plagiarism_data/train.csv
concaded df is
   0         0         1         2
0  1  1.000000  0.922280  0.820755
1  1  0.765306  0.589655  0.621711
2  1  0.884444  0.180995  0.597458
3  1  0.619048  0.043243  0.427835
4  1  0.920000  0.394366  0.775000
Now Saving
Path created: plagiarism_data/test.csv


## Up Next : Training a Model