<a href="https://colab.research.google.com/github/blue-create/langlens/blob/main/scripts/elinor_export.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Purpose

This file shows the steps we took to sample and create the annotation dataset.

## Connect with Google drive to access data 

In order to access the data, you first need to create a shortcut of the data folder to your own Gdrive. If you've been granted editing rights, you should be able to edit the content of the folder, i.e. add, move and delete data, create and rename folders, etc.

In [1]:
# connect with google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# redirect the working directory of this script to the data folder
%cd /content/drive/MyDrive/data/

/content/drive/MyDrive/data


## Load data

### Method 1: get csv files 

In [3]:
import tqdm as tqdm
import os
import pandas as pd

folder_path = "filtered2"

In [None]:
dfs = []

# loop through files 
for filename in os.listdir(folder_path):
    # if csv file, load and add to dfs  
    if filename.endswith(".csv"):
        file_path = os.path.join(folder_path, filename)
        df = pd.read_csv(file_path)
        dfs.append(df)

# combine files in df
df = pd.concat(dfs, ignore_index=True)

### Method 2: get a csv file

In [10]:
import tqdm as tqdm
import os
import pandas as pd

df_subset = pd.read_csv('sample.csv')

## Inspect data


In [7]:
print(df.shape)
df.info()

(50, 9)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  50 non-null     int64 
 1   artikel_id  50 non-null     object
 2   name        50 non-null     object
 3   jahrgang    50 non-null     int64 
 4   datum       50 non-null     int64 
 5   ressort     48 non-null     object
 6   titel       50 non-null     object
 7   untertitel  22 non-null     object
 8   text        50 non-null     object
dtypes: int64(3), object(6)
memory usage: 3.6+ KB


In [8]:
# remove first column 
df1 = df.drop("Unnamed: 0", axis=1)

## Create a random subset of the data 

In [None]:
# size of subset we want 
number = 100

In [None]:
perc = number/df1.shape[0]
df_subset = df1.sample(frac=perc, random_state=42)

In [None]:
df_subset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 141433 to 1157830
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   artikel_id  100 non-null    object 
 1   name        100 non-null    object 
 2   jahrgang    100 non-null    object 
 3   datum       100 non-null    float64
 4   ressort     89 non-null     object 
 5   titel       92 non-null     object 
 6   untertitel  37 non-null     object 
 7   text        100 non-null    object 
dtypes: float64(1), object(7)
memory usage: 7.0+ KB


## Adjust format for export

### String fixes 

In [11]:
# remove square brackets 
df_subset["text"] = df_subset["text"].apply(lambda x: x[1:-1] if (isinstance(x, str) and x.startswith("[") and x.endswith("]")) else x)
# remove backward slashes
df_subset['text'] = df_subset['text'].str.replace('\\', '', regex = False)


In [12]:
# function to split strings to list
def split_to_list(s):
    return s.split("', '")

In [13]:
# apply function to whole column
df_subset['text'] = df_subset['text'].apply(split_to_list)

In [14]:
# add new line to separate paragraphs 
df_subset['text'] = df_subset['text'].apply(lambda x: [char+'\n' for char in x])

### Split by paragraphs

In [21]:
# Explode "text" column
df_elinor = df_subset.explode("text")

# Create "artikel_order" column
df_elinor["artikel_order"] = df_elinor.groupby("artikel_id").cumcount() + 1


## Export as csv

In [23]:
output_path = "elinor"

In [24]:
df_elinor.to_csv(output_path+"/annotation_test2.csv", index=False, sep="\t", header=True)