# Let's Talk About Women's Health: Finding Endometriosis in Narrative Text
<strong>Name:</strong> Amber Kiser <br/>
<strong>Email:</strong> amber.kiser@utah.edu <br/>
<strong>UID:</strong> u0883495 <br/>

## Background
### Endometriosis
<img src='img/oneinten.jpg' width=300 height=300 align='left'/>
   
<div style='overflow:hidden; padding-left: 10px; padding-top: 50px;'>
<ul style="list-style-type:square;">
    <li>Endometriosis is a common yet debilitating disease, affecting about <strong>10%</strong> of reproductive-age
women.</li>
    <li>Symptoms include <strong>intense pelvic pain</strong> and <strong>devastating infertility</strong>.</li>
    <li>A reliable diagnosis depends on <strong>surgery</strong>.</li>
    <li>Significant diagnostic delay of <strong>9-12 years</strong>, leading to many <strong>negative outcomes</strong> such as depression, anxiety, and difficulty with intimate relationships.</li>
</ul>
<h3 style='padding-left: 15px; padding-top: 30px;'>Research Question: Can we classify endometriosis from narrative text?</h3>
</div>




## Ethical Considerations
The problem that I am working on can be very sensitive and private. However, at the beginning of this project I chose to use Reddit, an open, publicly available forum, from which to pull data. There is an understanding that text posted on Reddit, in open subreddits, is available to all, mitigating privacy concerns. In addition, I am not pulling in usernames, effectively de-identifying the dataset.

### Stakeholders include:
<dl>
    <dt>Redditors</dt>
        <dd>These are the people who post on reddit. The incentive to post on reddit is peer feedback. People can post questions or descriptions and get feedback from others in the same or similar situations. Many people ask on reddit if their symptoms align with endometriosis, the same question my classifier will work to answer.</dd>
    <dt>Patients</dt>
        <dd>These people may or may not post on reddit. They experience symptoms, possibly related to endometriosis and seek out treatment. Like the previous stakeholders, these people want to understand the cause of their symptoms in order to get effective treatment. My classifier should aid in this process.</dd>
    <dt>Healthcare Providers</dt>
        <dd>These people provide diagnosis and treatment for patients. They want to increase the quality of life in their patients. My classifier should aid them in their diagnosis.</dd> 
</dl>


## Reddit Data
The data consists of submissions and comments from these subreddits:
<ul style="list-style-type:square;">
    <li><a href='https://www.reddit.com/r/endometriosis/'>/r/endometriosis</a></li>
    <li><a href='https://www.reddit.com/r/Endo/'>/r/Endo</a></li>
    <li><a href='https://www.reddit.com/r/PCOS/'>/r/PCOS</a></li>
</ul>

These were separated into two categories, endometriosis-related posts and non-endometriosis-related posts. 

To pull the Reddit data, the python package <em><a href='https://github.com/dmarx/psaw'>PSAW</a></em> was used, which is a wrapper for the <a href='https://github.com/pushshift/api'>Pushshift API</a>.

## Methods
### Exploratory Analysis
<ol>
    <li>Clean the text, including:
        <ol style="list-style-type:lower-roman;">
            <li>Tokenize text into words.</li>
            <li>Make all words lowercase.</li>
            <li>Remove punctuation and stopwords.</li>
        </ol>
    <li>Identify the most common words in each category.</li>
    <li>Analyze the length of posts in each category.</li>
    <li>Create word clouds for each category, visualizing the data.</li>
</ol>

### Model Development
<ol>
    <li>Split the data into a training set (70%) and testing set (30%).</li>
    <li>Tune the hyperparameters of the text vectorizers (count and TF-IDF) and classifiers (support vector machine, random forest, and neural network), using cross validation on the training data set.</li>
    <li>Train the vectorizers and classifiers, using the best parameters found from tuning.</li>
    <li>Evalute and compare the final models, using accuracy, area under the receiver operating characteristic curve (AUC), recall and precision.</li>
</ol>


## Results
There were a total of 76,668 posts retrieved, including 39,464 (51.5%) endometriosis-related posts and 37,204 (48.5%) non-endometriosis-related posts. 
<br/>
An example of an endometriosis-related post is:  
<blockquote>"Lost 2 jobs because of endo. The pain is one thing and the fatigue is another. It sucks. I loved my job too. Considering a work from home job to see if I can kickstart my career again."</blockquote>
<br/>
An example of a non-endometriosis-related post is:  
<blockquote>"Just been diagnosed aged 30 and TTC I’m new here. I just had an ultrasound after having fertility issues which showed I have PCOS. I’m not really sure how to feel about it but thought this would be a safe space to come.   My ovaries are 50% and 75% larger than average. Not had a period for more than 100 days and really struggling to wrap my head around what this means for us.   Just thought I’d reach out in case anyone is in the same situation ❤️"</blockquote>

### Exploratory Analysis
The most common words in the endometriosis-related posts included <em>pain</em> (by far the most common), <em>surgery</em>, and <em>feel</em>. The most common words in the non-endometriosis-related posts included <em>hair</em>, <em>weight</em>, and <em>period</em>.
<figure style='display: inline-block; padding: 30px;'>
    <img src='results/images/most_common_endo.png' height=400 width=600 align='left'/>
    <figcaption>Figure 1: Most common words seen in the endometriosis-related posts.</figcaption>
</figure>
<figure style='display: inline-block; padding: 30px;'>
    <img src='results/images/most_common_pcos.png' height=400 width=600 align='left'/>
    <figcaption>Figure 2: Most common words seen in the non-endometriosis-related posts.</figcaption>
</figure>

<br/>
<br/>
The average length of an endometriosis-related post was 639 characters, ranging from 1 to 17,853 characters, while the average length of a non-endometriosis-related post was 715 characters, ranging from 2 to 14,777 characters.  
<figure style='display: inline-block; padding: 30px;'>
    <img src='results/images/post_lengths.png' height=400 width=600 align='left'/>
    <figcaption>Figure 3: Lengths of posts in each category.</figcaption>
</figure>

<br/>
<br/>
The word clouds reflect the most common words from the bar charts above.
<figure style='display: inline-block; padding: 30px;'>
    <img src='results/images/endo_word_cloud.png' height=400 width=600 align='left'/>
    <figcaption>Figure 4: Word cloud representing the most common words seen in the endometriosis-related posts.</figcaption>
</figure>

<figure style='display: inline-block; padding: 30px;'>
    <img src='results/images/pcos_word_cloud.png' height=400 width=600 align='left'/>
    <figcaption style='display: inline-block;'>Figure 5: Word cloud representing the most common words seen in the non-endometriosis-related posts.</figcaption>
</figure>

### Model Development
<table>
    <tr style="background-color:rgba(22,22,22,0.2);">
        <th>Model</th>
        <th>Accuracy</th>
        <th>Recall</th>
        <th>Precision</th>
        <th>AUC</th>
    </tr>
    <tr style="background-color:rgba(22,22,22,0);">
        <td>Random Forest - Count</td>
        <td>0.859</td>
        <td>0.930</td>
        <td>0.819</td>
        <td>0.938</td>
    </tr>
    <tr style="background-color:rgba(22,22,22,0);">
        <td>Random Forest - TF-IDF</td>
        <td>0.858</td>
        <td>0.932</td>
        <td>0.818</td>
        <td>0.939</td>
    </tr>
    <tr style="background-color:rgba(22,22,22,0.1);">
        <td>SVM - Count</td>
        <td>0.848</td>
        <td>0.938</td>
        <td>0.801</td>
        <td>0.938</td>
    </tr>
    <tr style="background-color:rgba(22,22,22,0.1);">
        <td>SVM - TF-IDF</td>
        <td></td>
        <td></td>
        <td></td>
        <td></td>
    </tr>
    <tr style="background-color:rgba(22,22,22,0);">
        <td>Neural Network - Count</td>
        <td></td>
        <td></td>
        <td></td>
        <td></td>
    </tr>
    <tr style="background-color:rgba(22,22,22,0);">
        <td>Neural Network - TF-IDF</td>
        <td></td>
        <td></td>
        <td></td>
        <td></td>
    </tr>
</table>

<div style='color:red;'>
**** Put in AUC curve comparison chart.
**** Put in Feature importances chart.
</div>


## Discussion 

From the exploratory analysis, it is obvious that the endometriosis posts are dominated by the word "pain." However, the non-endometriosis posts have less of a gap between their first and second most common words. The words seen are those expected, as women with endometriosis suffer from debilitating pain and often have surgery. Women with PCOS, the non-endometriosis category, experience hair loss and weight gain more often. The lengths of the posts was not significantly different between categories, eliminating a potential bias. 

Some limitations of this study included a potential selection bias, as only people with internet access who also participate in social media are included. Also it is not a requirement to be diagnosed with the condition when posting in the subreddit. Hardware also served as a limitation when tuning and training the neural network, as large graphs exhausted the memory of the compute clusters and were not able to finish training. 

Future work could include using noun phrases rather than just unigrams and bigrams. A more curated dataset could be used that only includes posts with a user confirmation of diagnosis.

## References
Agarwal SK, Chapron C, Giudice LC, et al. Clinical diagnosis of endometriosis: a call to action.
Am J Obstet Gynecol 2019 doi: 10.1016/j.ajog.2018.12.039.

Agrawal S, Tapmeier T, Rahmioglu N, Kirtley S, Zondervan K, Becker C. The miRNA Mirage:
How Close Are We to Finding a Non-Invasive Diagnostic Biomarker in Endometriosis? A
Systematic Review. Int J Mol Sci 2018;19(2) doi: 10.3390/ijms19020599.

Bjorkman S, Taylor HS. microRNAs in endometriosis: Biological function and emerging
biomarker candidates. Biol Reprod 2019 doi: 10.1093/biolre/ioz014.

Chapron C, Querleu D, Bruhat MA, et al. Surgical complications of diagnostic and operative
gynaecological laparoscopy: a series of 29,966 cases. Hum Reprod 1998;13(4):867-72 doi:
10.1093/humrep/13.4.867.

Kang SB, Chung HH, Lee HP, Lee JY, Chang YS. Impact of diagnostic laparoscopy on the
management of chronic pelvic pain. Surg Endosc 2007;21(6):916-9 doi: 10.1007/s00464-
006-9047-1.