# Análise XQuery

## Modelo Conceitual
![Modelo Conceitual](../img/conceitual-v4-xpath.png "Title")

## Modelo Lógico

- **Disease**(*name*, occurrences)
- **Symptom**(*name*, occurrences)
- **Cause**(*disease*, *symptom*, occurrences, score)
    - Chave Estrangeira: *disease* -> **Disease**
    - Chave Estrangeira: *symptom* -> **Symptom**


## Conversão dos CSV de dados para XML
Reduzido a quantidade de dados consideravelmente para poder realizar as análises no zorba. Apenas 100 doenças e 5 sintomas por doenças foram utilizadas.

```python

# cria um dicionário de sintomas
symptoms_dic = {}
with open('ncomms-symptom.csv', 'r') as fin:
    symptoms = fin.read().split('\n')[1:]

    for symptom in symptoms:
        if symptom:
            name, occurrences = symptom.split('|')
            symptoms_dic[name] = int(occurrences)

# cria um dicionário de causas
causes_dic = {}
with open('ncomms-cause.csv', 'r') as fin:
    causes = fin.read().split('\n')[1:]

    for cause in causes:
        if cause:
            symptom, disease, occurrences, score = cause.split('|')
            
            if disease not in causes_dic:
                causes_dic[disease] = []

            causes_dic[disease].append({
                'symptom': symptom,
                'occurrences': int(occurrences),
                'score': float(score)
            })
    
    # limita a quantidade de sintomas por doença
    for disease in causes_dic:
        causes_dic[disease].sort(key=lambda k: k['score'], reverse=True)
        causes_dic[disease] = causes_dic[disease][:5]

xml = '<?xml version="1.0" encoding="UTF-8"?>\n'
xml += '<diseases>\n'

with open('ncomms-disease.csv', 'r') as fin:
    diseases = fin.read().split('\n')[1:100]
    
    for disease in diseases:
        if disease:
            name, occurrences = disease.split('|')

            xml += '\t<disease name="%s">\n' % name
            xml += '\t\t<occurrences>%s</occurrences>\n' % occurrences

            # adiciona os sintomas
            xml += '\t\t<symptoms>\n'
            if name in causes_dic:
                for symptom in causes_dic[name]:
                    xml += '\t\t\t<symptom name="%s">\n' % symptom['symptom']
                    xml += '\t\t\t\t<occurrences>%d</occurrences>\n' % symptom['occurrences']
                    xml += '\t\t\t\t<score>%f</score>\n' % symptom['score']
                    xml += '\t\t\t</symptom>\n'

            xml += '\t\t</symptoms>\n'

            xml += '\t</disease>\n'

xml += '</diseases>\n'

# save xml file
with open('ncomms.xml', 'w') as fout:
    fout.write(xml)
```

### Retornar as doenças rotuladas como Diabetes

```xquery
let $ncomms := doc('mydoc.xml')

for $i in $ncomms//disease
where matches($i/@name, 'Diabetes')
order by $i/occurrences descending
return $i
```

Retorno:
```xml
<?xml version="1.0" encoding="UTF-8"?>
<disease name="Diabetes Mellitus, Type 2">
    <occurrences>46242</occurrences>
    <symptoms>
        <symptom name="Obesity">
            <occurrences>4638</occurrences>
            <score>3861.383836</score>
        </symptom>
        <symptom name="Albuminuria">
            <occurrences>1854</occurrences>
            <score>3828.670677</score>
        </symptom>
        <symptom name="Weight Loss">
            <occurrences>671</occurrences>
            <score>671.036040</score>
        </symptom>
        <symptom name="Proteinuria">
            <occurrences>465</occurrences>
            <score>580.774665</score>
        </symptom>
        <symptom name="Weight Gain">
            <occurrences>390</occurrences>
            <score>579.808999</score>
        </symptom>
    </symptoms>
</disease>
<disease name="Diabetes Mellitus">
    <occurrences>38441</occurrences>
    <symptoms>
        <symptom name="Obesity">
            <occurrences>3255</occurrences>
            <score>2709.962136</score>
        </symptom>
        <symptom name="Albuminuria">
            <occurrences>327</occurrences>
            <score>675.283340</score>
        </symptom>
        <symptom name="Body Weight">
            <occurrences>1100</occurrences>
            <score>377.286632</score>
        </symptom>
        <symptom name="Weight Loss">
            <occurrences>217</occurrences>
            <score>217.011655</score>
        </symptom>
        <symptom name="Proteinuria">
            <occurrences>170</occurrences>
            <score>212.326222</score>
        </symptom>
    </symptoms>
</disease>
<disease name="Diabetes Mellitus, Type 1">
    <occurrences>37268</occurrences>
    <symptoms>
        <symptom name="Albuminuria">
            <occurrences>1623</occurrences>
            <score>3351.635658</score>
        </symptom>
        <symptom name="Proteinuria">
            <occurrences>347</occurrences>
            <score>433.395287</score>
        </symptom>
        <symptom name="Obesity">
            <occurrences>477</occurrences>
            <score>397.128092</score>
        </symptom>
        <symptom name="Birth Weight">
            <occurrences>206</occurrences>
            <score>289.868974</score>
        </symptom>
        <symptom name="Fetal Macrosomia">
            <occurrences>86</occurrences>
            <score>286.957758</score>
        </symptom>
    </symptoms>
</disease>
<disease name="Diabetes Mellitus, Experimental">
    <occurrences>19998</occurrences>
    <symptoms>
        <symptom name="Body Weight">
            <occurrences>2268</occurrences>
            <score>777.896437</score>
        </symptom>
        <symptom name="Albuminuria">
            <occurrences>255</occurrences>
            <score>526.597100</score>
        </symptom>
        <symptom name="Obesity">
            <occurrences>321</occurrences>
            <score>267.249722</score>
        </symptom>
        <symptom name="Hyperalgesia">
            <occurrences>90</occurrences>
            <score>215.469907</score>
        </symptom>
        <symptom name="Proteinuria">
            <occurrences>161</occurrences>
            <score>201.085422</score>
        </symptom>
    </symptoms>
</disease>
```

### Retornar as doenças e os sintomas acima da média de score

```xquery
let $ncomms := doc('mydoc.xml')

for $i in ($ncomms//disease)
return
<disease>
  <name>{data($i/@name)}</name>
  <symptoms>
    <avgscore>{avg($i/symptoms/symptom/score)}</avgscore>
    {
      for $j in ($i/symptoms/symptom)
      where $j[score>avg($i/symptoms/symptom/score)]
      return $j
    }
  </symptoms>
</disease>
```

Retorno (limitado às duas primeiras doenças):
```xml
<?xml version="1.0" encoding="UTF-8"?>
<disease>
    <name>Breast Neoplasms</name>
    <symptoms>
        <avgscore>396.186838</avgscore>
        <symptom name="Hot Flashes">
            <occurrences>191</occurrences>
            <score>650.490799</score>
        </symptom>
    </symptoms>
</disease>
<disease>
    <name>Hypertension</name>
    <symptoms>
        <avgscore>1620.9151506</avgscore>
        <symptom name="Albuminuria">
            <occurrences>1316</occurrences>
            <score>2717.654051</score>
        </symptom>
        <symptom name="Obesity">
            <occurrences>2997</occurrences>
            <score>2495.163294</score>
        </symptom>
    </symptoms>
</disease>
```

### Retornar as doenças que possuem o sintoma 'Obesity'

```xquery
let $ncomms := doc('mydoc.xml')

for $i in ($ncomms//disease)
where ($i[symptoms/symptom/@name="Obesity"])
return <disease>{data($i/@name)}</disease>
```

Retorno:
```xml
<?xml version="1.0" encoding="UTF-8"?>
<disease>Breast Neoplasms</disease><disease>Hypertension</disease><disease>Coronary Artery Disease</disease><disease>Coronary Disease</disease><disease>Prostatic Neoplasms</disease><disease>Obesity</disease><disease>Diabetes Mellitus, Type 2</disease><disease>Kidney Failure, Chronic</disease><disease>Diabetes Mellitus</disease><disease>Diabetes Mellitus, Type 1</disease><disease>Kidney Diseases</disease><disease>Myocardial Ischemia</disease><disease>Colorectal Neoplasms</disease><disease>Carcinoma, Hepatocellular</disease><disease>Liver Diseases</disease><disease>Arteriosclerosis</disease><disease>Liver Cirrhosis</disease><disease>Inflammation</disease><disease>Pregnancy Complications</disease><disease>Diabetes Mellitus, Experimental</disease><disease>Uterine Neoplasms</disease><disease>Esophageal Neoplasms</disease><disease>Psoriasis</disease>
```

### Retornar a média de 'occurrences' das doenças

```xquery
let $ncomms := doc('mydoc.xml')

return avg($ncomms//disease/occurrences)
```

Retorno:
```xml
<?xml version="1.0" encoding="UTF-8"?>
33340.515151515152
```

### Retornar a quantidade de doenças que estão acima da média de 'occurrences'

```xquery
let $ncomms := doc('mydoc.xml')

return count($ncomms//disease[occurrences > avg($ncomms//disease/occurrences)])
```

Retorno:
```xml
<?xml version="1.0" encoding="UTF-8"?>
32
```

# Análise SPARQL

Utilizamos os dados do MESH

In [28]:
%endpoint http://id.nlm.nih.gov/mesh/sparql
%display table
%show all

### Retornar todas as doenças do sistema digestivo

In [29]:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
PREFIX mesh: <http://id.nlm.nih.gov/mesh/>
PREFIX mesh2015: <http://id.nlm.nih.gov/mesh/2015/>
PREFIX mesh2016: <http://id.nlm.nih.gov/mesh/2016/>
PREFIX mesh2017: <http://id.nlm.nih.gov/mesh/2017/>

SELECT DISTINCT ?descriptor ?label
WHERE {
    mesh:D004066 meshv:treeNumber ?treeNum .
    ?childTreeNum meshv:parentTreeNumber+ ?treeNum .
    ?descriptor meshv:treeNumber ?childTreeNum .
    ?descriptor rdfs:label ?label .
}
ORDER BY ?label

descriptor,label
http://id.nlm.nih.gov/mesh/D042101,Acalculous Cholecystitis
http://id.nlm.nih.gov/mesh/D000126,Achlorhydria
http://id.nlm.nih.gov/mesh/D065290,Acute-On-Chronic Liver Failure
http://id.nlm.nih.gov/mesh/D007516,"Adenoma, Islet Cell"
http://id.nlm.nih.gov/mesh/D018248,"Adenoma, Liver Cell"
http://id.nlm.nih.gov/mesh/D011125,Adenomatous Polyposis Coli
http://id.nlm.nih.gov/mesh/D000343,Afferent Loop Syndrome
http://id.nlm.nih.gov/mesh/D016738,Alagille Syndrome
http://id.nlm.nih.gov/mesh/D000694,Anal Gland Neoplasms
http://id.nlm.nih.gov/mesh/D017129,Anisakiasis


### Doenças altamente similares a 'Hepatitis B' em todos os níveis

```
MATCH (d:Disease {name: "Hepatitis B"})-[:VERY_SIMILAR_TO*]-(a:Disease)
RETURN d, a
LIMIT 50
```

![Hepatitis B](../img/hepatitis_b.png "Title")

### Quais doenças do sistema digestivo são infecções bacterianas

In [41]:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
PREFIX mesh: <http://id.nlm.nih.gov/mesh/>
PREFIX mesh2015: <http://id.nlm.nih.gov/mesh/2015/>
PREFIX mesh2016: <http://id.nlm.nih.gov/mesh/2016/>
PREFIX mesh2017: <http://id.nlm.nih.gov/mesh/2017/>

SELECT DISTINCT ?descriptor ?label
WHERE {
    mesh:D004066 meshv:treeNumber ?treeNum .
    ?childTreeNum meshv:parentTreeNumber+ ?treeNum .
    ?descriptor meshv:broaderDescriptor+ mesh:D001424 .
    ?descriptor meshv:treeNumber ?childTreeNum .
    ?descriptor rdfs:label ?label .
}

descriptor,label
http://id.nlm.nih.gov/mesh/D004405,"Dysentery, Bacillary"
http://id.nlm.nih.gov/mesh/D004761,"Enterocolitis, Pseudomembranous"
http://id.nlm.nih.gov/mesh/D008061,Whipple Disease
http://id.nlm.nih.gov/mesh/D014385,"Tuberculosis, Gastrointestinal"
http://id.nlm.nih.gov/mesh/D014386,"Tuberculosis, Hepatic"
http://id.nlm.nih.gov/mesh/D014395,"Peritonitis, Tuberculous"


### Relações entre sintomas de infecções bacterianas do sistema digestivo

```
MATCH (a:Disease {name: "Dysentery, Bacillary"})-[:CAUSES]->(s:Symptom)
MATCH (b:Disease {name: "Tuberculosis, Gastrointestinal"})-[:CAUSES]->(s)
MATCH (c:Disease {name: "Enterocolitis, Pseudomembranous"})-[:CAUSES]->(s)
MATCH (d:Disease {name: "Whipple Disease"})-[:CAUSES]->(s)
return a, b, c, d, s
```

![Infecções Bacterianas](../img/infeccao_bacteriana_digestivo.png "Title")

### Quais doenças do sistema digestivo são intestinais

In [43]:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
PREFIX mesh: <http://id.nlm.nih.gov/mesh/>
PREFIX mesh2015: <http://id.nlm.nih.gov/mesh/2015/>
PREFIX mesh2016: <http://id.nlm.nih.gov/mesh/2016/>
PREFIX mesh2017: <http://id.nlm.nih.gov/mesh/2017/>

SELECT DISTINCT ?descriptor ?label
WHERE {
    mesh:D004066 meshv:treeNumber ?treeNum .
    ?childTreeNum meshv:parentTreeNumber+ ?treeNum .
    ?descriptor meshv:broaderDescriptor mesh:D007410 .
    ?descriptor meshv:treeNumber ?childTreeNum .
    ?descriptor rdfs:label ?label .
}

descriptor,label
http://id.nlm.nih.gov/mesh/D002429,Cecal Diseases
http://id.nlm.nih.gov/mesh/D003108,Colonic Diseases
http://id.nlm.nih.gov/mesh/D004378,Duodenal Diseases
http://id.nlm.nih.gov/mesh/D004403,Dysentery
http://id.nlm.nih.gov/mesh/D004760,Enterocolitis
http://id.nlm.nih.gov/mesh/D015212,Inflammatory Bowel Diseases
http://id.nlm.nih.gov/mesh/D011504,Protein-Losing Enteropathies
http://id.nlm.nih.gov/mesh/D012002,Rectal Diseases
http://id.nlm.nih.gov/mesh/D044483,Intestinal Polyposis
http://id.nlm.nih.gov/mesh/D007077,Ileal Diseases
