**Code for testing validity of your predictions .csv file**

Once you have created your data frame (I am calling it df) with predictions, you need to write it out as a csv file for submission.

When you do this, make sure to use the index=False option as in

    df.write_csv(filename,index=False)

where filename is any name you wish to use.

The csv file with your predictions needs to be in a specific format described as follows:

- There should be exactly 4 columns
- There should be a header row with the following entries:
    - URLID
    - length - int or float
    - word_present - 0/1-valued
    - edited_2023 - 0/1-valued
- There should be an additional 50,000 rows, one for each url/page for which you are making predictions
- The URLID column is a very important one. To evaluate your prediction peformance, this file will be merged with the actual responses (only I have those) by URLID so it is very important that the URLID's can be matched. 

Before submitting your predictions, you should save your csv to to disk and run the following program to read it in and determine whether it has been formatted properly. 

You should keep checking your csv file until it is determined to have been formatted properly.

In [1]:
import pandas as pd
import numpy as np

def check_csvfile_format(filename):
    try:
        df=pd.read_csv(filename)
    except:
        print("error: pandas cannot read your csv file")
        return
    if df.shape[1]==5:
        print("your file has 5 columns:")
        print(list(df.columns))
        print("these are the columns you should have included")
        print("URLID,length,word_present,edited_2023")
        print("did you remember to use the index=False option when you used df.to_csv(...)?")
        return
    elif df.shape[1]>5:
        print("your file has too many columns")
        print(list(df.columns))
        print("these are the columns you should have included")
        print("URLID,length,word_present,edited_2023")
        print("did you remember to use the index=False option when you used df.to_csv(...)?")
        return
    elif df.shape[1]<4:
        print("your file has too few columns")
        print(list(df.columns))
        print("these are the columns you should have included")
        print("URLID,length,word_present,edited_2023")     
        return
    if df.columns[0]!="URLID":
        print("incorrect column 0 - it should be URLID")
        return
    if df.columns[1]!="length":
        print("incorrect column 1 - it should be length")
        return
    if df.columns[2]!="word_present":
        print("incorrect column 0 - it should be word_present")
        return
    if df.columns[3]!="edited_2023":
        print("incorrect column 1 - it should be edited_2023")
        return
    if df.shape[0]!=50000:
        print("your file has "+str(df.shape[0])+"rows")
        print("this is not the correct number - should be 50,000")
        return
    n=df.URLID.nunique()
    if n!=50000:
        print("there is something wrong with your URLID column")
        print("it has "+str(n)+" unique values instead of 50,000")
        return
    
    if df.dtypes[0]!="object":
        print("error: URLID column type should be object")
        return
    if df.dtypes[1]!="int64" and df.dtypes[1]!="float64": 
        print("error: length column type should be int64 or float64")
        return
    if df.dtypes[2]!="int64" and df.dtypes[2]!="float64" and df.dtypes[2]!="bool": 
        print("error: word_present column type should be int64, float64 or bool")
        return
    if df.dtypes[3]!="int64" and df.dtypes[2]!="float64" and df.dtypes[2]!="bool": 
        print("error: edited_2023 column type should be int64, float64 or bool")
        return
    
    nmissing=df.length.isna().sum()
    if nmissing>0:
        print("there are missing values in the length column")
        print("you are not allowed to have missing values")
        return
    nmissing=df.word_present.isna().sum()
    if nmissing>0:
        print("there are missing values in the word_present column")
        print("you are not allowed to have missing values")
        return
    
    nmissing=df.edited_2023.isna().sum()
    if nmissing>0:
        print("there are missing values in the edited_2023 column")
        print("you are not allowed to have missing values")
        return
    print("csv file appears to be correctly formatted")
    
    vc=df.word_present.value_counts()
    if len(vc)!=2:
        print("your word_present column should only contain 2 distinct values")
        print("yours has "+str(len(vc)))
        print("here are your values and counts")
        print(vc)
        return
    vc=df.edited_2023.value_counts()
    if len(vc)!=2:
        print("your edited_2023 column should only contain 2 distinct values")
        print("yours has "+str(len(vc)))
        print("here are your values and counts")
        print(vc)
        return

In [2]:
check_csvfile_format('./final_part2-1.csv')

csv file appears to be correctly formatted
