<a href="https://colab.research.google.com/github/blue-create/langlens/blob/main/export/elinor_experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Purpose:
- exporting csvs to be uploaded as experiment to elinor

### Imports

In [1]:
import pandas as pd
import os
import numpy as np
import json
from ast import literal_eval

### Constants & Methods

In [2]:
exclude_articles=[# Notufe, Beratungen
                        "Bereitschaftsdienst", "Hotline", "Notruf", "Hilfetelefon","behindertenfahrdienst","Polizeiinspektion",
                        "Feuerwehr","rettungsdienst", "Notdienst","Bereitschaftspraxis","Öffnungszeiten","Vergiftungen",
                        "Ärztehaus","Selbsthilfegruppe","Leitstelle","Tel","Aids","Ambulante","ACE",
                        "Club","Interventionsstelle","Frauenberatungsstelle","Rufnummer","Rufnummern", "apotheke", "hilfsangebot","hilfsangebote", "Klinikum"
                        "opferhilfe","Berufsbildungszentrum","opferschutz",
                        # Kampagnen, Akitonen
                        "kampagne", "aktion", "ring","initiative","Frauen helfen Frauen"
                        # Events, Services
                        "Mo Di","mi Do","do fr", "mo do","sa so", "sa mo","di mi","fr sa", "online","Ü50 Singletreff", "Uhr","Treffpunkt",
                        #corona
                        "Dieser Artikel wird laufend aktualisiert"
]

### Connect with Google drive to access data


In [3]:
# connect with google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# redirect the working directory of this script to the data folder
%cd /content/drive/MyDrive/Work/Frontline/data/
#%cd /content/drive/MyDrive/data/

/content/drive/.shortcut-targets-by-id/1WfnZsqpG1r110J63sMbfS5TpsDOkveiV/data


In [5]:
from scripts import filtering

Import filtered articles

In [6]:
df=pd.read_csv("filtered/filtered_06_16.csv",converters={"text":literal_eval})

In [7]:
# Explode "text" column
df_exploded= df.explode("text")
# Create "artikel_order" column
df_exploded["artikel_order"] = df_exploded.groupby("artikel_id").cumcount() + 1
df_exploded.shape

(228753, 10)

In [8]:
# remove duplicated paragraphs
df_exploded=df_exploded.drop_duplicates("text")
df_exploded.shape

(154941, 10)

In [9]:
# remove hotlines, etc, if keywords contained in the first 5 words
df_exploded=filtering.filter_data(df_exploded,"text",exclude_articles,False,5)
# remove paragraphs by keywords if container in the first 5 words

exclude_paragraphs=["Stadtteiltreff","Plakataktion","One Billion Rising","Gewalt kommt nicht in die Tüte","opferschutzorganisation","Frauen helfen Frauen","statistik", "kriminalstatistik", "landeskriminalamt"]
df_exploded=filtering.filter_data(df_exploded,"text",exclude_paragraphs,False,5)

(144351, 10)
(143740, 10)


In [10]:
# regex filter: email, links, times, streets,email, weekdays
df_exploded=filtering.regex_filter(df_exploded,"text",)

In [11]:
# very short paragraphs usually are not part of the article
df_exploded.loc[:,"chars"]=df_exploded["text"].apply(len)
df_exploded=df_exploded[df_exploded["chars"]>60]
df_exploded.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_exploded.loc[:,"chars"]=df_exploded["text"].apply(len)


(115556, 11)

### Exclude Paragraphs
- that are annotated already, or
- are currently part of an experiment in elinor

Extracting JSON files

In [37]:
dfs_json={}
for doc in os.listdir("annotated"):
  if doc.endswith(".json"):
    #read json data
    json_data=json.load(open("annotated/"+doc, encoding="utf-8"))
    #convert to dataframe
    data=pd.DataFrame(json_data["documents"],)
    #for now: filter out paragraphs that have not been annotated
    data=data[data["annotations"].apply(len)>0]
    data.loc[:,"file"]=doc
    dfs_json[doc]=data
dfs_json=pd.concat(dfs_json)

In [38]:
dfs_json["artikel_id"]=dfs_json.attributes_flat.apply(lambda x: x["artikel_id"])
dfs_json["artikel_order"]=dfs_json.attributes_flat.apply(lambda x: x["artikel_order"])
dfs_json.artikel_order=dfs_json.artikel_order.astype(float)

In [39]:

dfs_csv = []

# loop through files
for filename in os.listdir("annotated"):
    # if csv file, load and add to dfs
    if filename.endswith(".csv"):
        file_path = os.path.join("annotated", filename)
        # import csv with text as list object
        df = pd.read_csv(file_path, index_col=0, converters={"annotations":literal_eval,})
        df.loc[:,"file"]=filename
        dfs_csv.append(df)
# combine files in df
dfs_csv = pd.concat(dfs_csv, ignore_index=True)
dfs_csv.artikel_order=dfs_csv.artikel_order.astype(float)

In [40]:
dfs_all=pd.concat([dfs_csv,dfs_json])

In [45]:
dfs_all_i=dfs_all[["artikel_id","artikel_order"]].dropna()
dfs_all_i["valid"]=False

In [50]:
df_merged=pd.merge(dfs_all_i,df_exploded,left_on=["artikel_id","artikel_order"],right_on=["artikel_id","artikel_order"], how="right")

(2134, 3)

In [52]:
df_valid=df_merged[df_merged.valid!=False]

### Randomly select one paragraph per article

In [53]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [54]:
def count_sentences(text):
    return len(nltk.sent_tokenize(text))

In [55]:
df_valid['num_sentences'] = df_valid['text'].apply(count_sentences)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_valid['num_sentences'] = df_valid['text'].apply(count_sentences)


In [56]:
# Define a function to randomly select one row from each group
def select_random_row(group):
    if group['num_sentences'].max() > 1:
        return group[group['num_sentences'] > 1].sample(n=1)
    else:
        return group.head(1)

In [57]:
# Apply the function to each group and combine the results
random_rows = df_valid.groupby('artikel_id').apply(select_random_row).reset_index(drop=True)

In [81]:
random_rows=random_rows.drop("valid",axis=1)

### Export batches of 500

### Randomly select one paragraph per article

In [62]:
from datetime import datetime

In [83]:
date=datetime.now().strftime("%Y%m%d")
n_batches=8
size=500

In [84]:
for i in range(1,n_batches+1):
  chunk=random_rows.iloc[((i-1)*size):(i*size),:]
  chunk.to_csv(f"elinor/elinor_{date}_part{i}.csv", index=False, header = True,encoding = 'utf-8')