# Preprocessing

In dit notebook oefenen wij een beetje met preprocessing, in het bijzonder proberen we zinnen die een bepaald zoekpatroon bevatten te extraheren

In [69]:
import pandas as pd
import re
import nltk
from glob import glob

## Heeeeeel veel tekstbestanden inlezen

1. We maken twee lege listen aan, eentje voor de artikelen, eentje voor de bestandsnamen
2. We loopen met een for-loop (een lus dus) over alle bestandsnamen heen
3. voor elke bestandsnaam:
- openen we het bestaand (dat betekent nog niet *lezen*, alleen een verbinding maken
- lezen we de inhoud met `.read()` en voegen we het toe aan de lijst van artikelen
- voegen we ook de naam van het bestand toe aan een tweede lijst
We hebben nu twee lijsten van dezelfde lengte met bestandsnamen en artikelen. Daar maken we nu voor het gemak een dataframe met twee kolommen van


In [109]:
# we hebben een lijst nodig met alle bestanden die nodig zin.
# We doen het eerst zonder het resultaat toe te wijzen aan een object, gewoon om te kijken of we het goed doen
glob("artikelen/*.txt")

['artikelen/art2.txt', 'artikelen/art3.txt', 'artikelen/art1.txt']

In [80]:

artikelen = []
bestandsnamen = []
for fn in glob("artikelen/*.txt"):
    with open(fn) as f:
        artikelen.append(f.read())
        bestandsnamen.append(fn)

In [84]:
df = pd.DataFrame({"text":artikelen, "naam":bestandsnamen})

In [85]:
df

Unnamed: 0,text,naam
0,rjhwrk\nwrlhk. ekleh .. Deze zin bevat Immuno...,artikelen/art2.txt
1,ekhpworeh\nwhkw\n4jklrejk\nrejkl\nR5J\n,artikelen/art3.txt
2,ej gtiweohwhkl. Een zin met Immunotherapie. En...,artikelen/art1.txt


# Functie bedenken om zinnen te extraheren
We gaan een functie maken en deze dan op de dataframe loslaten

In [90]:
def extract_zinnen(x):
    pattern = r"[Ii]mmun[eo].?therap"   # TODO: checken of regexp klopt
    # dubbele spaties eruit:
    x = " ".join(x.split())
    zinnen = [z for z in nltk.sent_tokenize(x, language='english') if re.findall(pattern, z)]
    return zinnen


In [92]:
df['zinnen'] = df['text'].apply(extract_zinnen)

In [111]:
df

Unnamed: 0,text,naam,zinnen
0,rjhwrk\nwrlhk. ekleh .. Deze zin bevat Immuno...,artikelen/art2.txt,[ekleh .. Deze zin bevat Immunotherapie!]
1,ekhpworeh\nwhkw\n4jklrejk\nrejkl\nR5J\n,artikelen/art3.txt,[]
2,ej gtiweohwhkl. Een zin met Immunotherapie. En...,artikelen/art1.txt,"[Een zin met Immunotherapie., En nog een Immun..."


In [112]:
# we willen 1 rij per zin, niet 1 rij per artikel
# dit zouden we later bijvoorbeeld met .groupby("naam") weer terug kunnen draaien
output = df.explode('zinnen')
output

In [103]:
output.drop("text", axis=1).to_excel("output.xlsx")

# Hieronder wat rondgespeeld met de Exceldata en andere functiedefinties
dus niet meer echt up-to-date

Wel interessant: dat je met `df.iloc[rij,kolom]` een specifieke cell kunt pakken, en dus dingen als `extract_zin(df.iloc[0,2]` kunt doen om je funcite te testen.

In [113]:
df.iloc?

In [2]:
df1 = pd.read_excel("Alle_4599_chunks_EN_news.xlsx", header=None)
df2 = pd.read_excel("Alle_4761_chunks_EN_aca.xlsx", header=None)

In [4]:
df2

Unnamed: 0,0,1,2
0,1,be cognizant of these manifestations and be ...,Somarouthu-2018-Immune-related tumour response...
1,2,burden compared to pre-baseline levels and >...,Somarouthu-2018-Immune-related tumour response...
2,1,"experiments,® in which unblocking therapy wa...",Bansal-1978-Multimodal immunotherapy of primar...
3,2,penetrating radi- ation absorbed in a target...,Fisher-1994-Radiation dosimetry for radioimmun...
4,3,.’° These recently developed chemotherapy reg...,Milowsky-2002-Active chemotherapy for collecti...
...,...,...,...
4756,1,various stages of the disease. BCG has becom...,Nair-2020-The Tumor Microenvironment and Immun...
4757,2,of evidence is gathering concerning the role...,Nussenblatt-2007-Age-related macular degenerat...
4758,3,are currently available. Three other phase 3...,Patel-2020-Treatment of muscle-invasive and a.txt
4759,4,those with BMs. Intracranial response rates i...,Moravan-2020-Current multidisciplinary managem...


In [7]:
test = df1.iloc[0,1]
test

"  can be provided at half the cost of competing therapies.  to  modify  patients'  anxiety-reinforcing   therapy   (CBT)  aims   Acupuncture is viewed sceptically by many pain professionals but is popular with patients.Areview of clinical trial  evidence by the highly respected Cochrane Library concluded that it has a small effect in chronic low back pain.  Immunetherapy Scientists are now trying to tackle complex regional pain syndrome by using a treatment designed  for immune diseases such as rheumatoid arthritis. One small dose of intravenous immunoglobulin reduced pain for  fiveweeks  in  just  under  half  of  patients  treated,  a  Liverpool  University  reported  last  month  in  the  Annals  of  InternalMedicine.  John Naish  If the Is pain a symptom or a disease_;New research shows that chronic pain may be all in the mind. John NA.txt"

In [74]:
def extract_zin(x):
    pattern = r"[Ii]mmun[eo].?therap"   # TODO: checken of regexp klopt
    zinnen = [z for z in nltk.sent_tokenize(x, language='english') if re.findall(pattern, z)]
    zinnen = nltk.sent_tokenize(x, language='english') 
    matches = re.finditer(pattern, test)
    return zinnen


In [56]:
pattern = r"[Ii]mmun[eo].?therap"
matches = re.match(pattern, test)

In [60]:
for e in re.finditer(pattern, test):
    print(e.span())
    print()

(360, 372)

(901, 913)



In [68]:
e.end()

913

In [63]:
e.start()

901

## ff kijken of .split() en sent_tokenize verschillen....

In [23]:
nltk.sent_tokenize(df1.iloc[400,1], language='english')

['  are funding  the development of these medicines - many of which will fail before they reach patients.',
 'As we explore areas of  science that are increasingly difficult and risky, failure rates are increasing.',
 'Patients across the world who benefit  from the few medicines that do succeed (those offered personalised cancer care, curative hepatitis C medicines,  immunotherapy and malaria prevention) would surely disagree that the drugs offer little therapeutic advance.',
 'Furthermore, companies cannot set whatever price they like for a medicine, despite the rhetoric.',
 'Medicines in the  UK are subject to multiple levels of regulation, including by the National Institute for Health and Care Excellence  and  the  pharmaceutical  pricing  regulation  scheme  (']

In [24]:
df1.iloc[400,1].split('. ')

['  are funding  the development of these medicines - many of which will fail before they reach patients',
 'As we explore areas of  science that are increasingly difficult and risky, failure rates are increasing',
 'Patients across the world who benefit  from the few medicines that do succeed (those offered personalised cancer care, curative hepatitis C medicines,  immunotherapy and malaria prevention) would surely disagree that the drugs offer little therapeutic advance',
 ' Furthermore, companies cannot set whatever price they like for a medicine, despite the rhetoric',
 'Medicines in the  UK are subject to multiple levels of regulation, including by the National Institute for Health and Care Excellence  and  the  pharmaceutical  pricing  regulation  scheme  (']

... klein beetje dus

In [16]:
df1[3]= df1[1].apply(extract_zin)

In [17]:
df1

Unnamed: 0,0,1,3
0,1,can be provided at half the cost of competin...,Immunetherapy Scientists are now trying to ta...
1,1,Ltd All Rights Reserved Section: NEWS; Pg. ...,The immune-therapy is based on a biolog...
2,1,"costing from œ5,000. Add-ons - at a three-f...",Then there are immune therapies and blood tr...
3,1,Cancer defeated in a pilot program The Sun (...,14 Length: 125 words Byline: MARILYNN MARCHI...
4,1,-up by the company that there was bound to b...,Hopes for immuno-therapies are now so hyped-u...
...,...,...,...
4594,4780,forefront of your mind. I've even been able...,"Any more, and the only viable options are tre..."
4595,4781,"hopeful of raising more funds. ""It is imper...","""I'm so sorry we couldn't get to the stage of..."
4596,4782,"cancer can affect anyone, of any age with mo...","""I'm so sorry we couldn't get to the stage of..."
4597,4783,best possible chance offered by cutting-edge...,A final set of scans will show whether the im...
