# Project Introduction

Congratulations on finishing the modules! This is the project notebook for the text mining track. This will give you a chance to apply your skills. In this project, you will select your dataset, create your own analysis using the skills you've learned in the previous module, and write up your findings. This will give you a chance to get your hands dirty using real healthcare data

## Project Datasets

Below are a list of carefully curated datasets for you to use. Please read about your selected dataset and run the associated code cell for your dataset. 

However, if you prefer to use your own dataset go ahead!

### Drug Review Dataset (Druglib.com) Data Set

The dataset provides patient reviews on specific drugs along with related conditions. Reviews and ratings are grouped into reports on the three aspects benefits, side effects and overall comment.  
The dataset provides patient reviews on specific drugs along with related conditions. Furthermore, reviews are grouped into reports on the three aspects benefits, side effects and overall comment. Additionally, ratings are available concerning overall satisfaction as well as a 5 step side effect rating and a 5 step effectiveness rating. The data was obtained by crawling online pharmaceutical review sites. The intention was to study  
  
(1) sentiment analysis of drug experience over multiple facets, i.e. sentiments learned on specific aspects such as effectiveness and side effects,  
(2) the transferability of models among domains, i.e. conditions, and  
(3) the transferability of models among different data sources (see 'Drug Review Dataset (Drugs.com)').
  
The dataset includes a separate testing dataset (drugLibTest_raw.csv). Use drugLibTrain_raw.csv for building a prediction model. Then measure performance metrics (e.g., accuracy, precision, recall, etc) using drugLibTest_raw.csv file. 

*Source: Felix Gräßer, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder. 2018. Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning. In Proceedings of the 2018 International Conference on Digital Health (DH '18). ACM, New York, NY, USA, 121-125.*  
  
Visit https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Druglib.com%29 for more information

In [None]:
# Code reading in the training dataset
import pandas as pd
druglib_train = pd.read_csv('drugLibTrain_raw.csv')

#### Druglib.com Data Dictionary

1. urlDrugName (categorical): name of drug  
2. condition (categorical): name of condition  
3. benefitsReview (text): patient on benefits  
4. sideEffectsReview (text): patient on side effects  
5. commentsReview (text): overall patient comment  
6. rating (numerical): 10 star patient rating  
7. sideEffects (categorical): 5 step side effect rating  
8. effectiveness (categorical): 5 step effectiveness rating  

### Drug Review Dataset (Drugs.com) Data Set

The dataset provides patient reviews on specific drugs along with related conditions and a 10 star patient rating reflecting overall patient satisfaction. The data was obtained by crawling online pharmaceutical review sites. The intention was to study  
  
(1) sentiment analysis of drug experience over multiple facets, i.e. sentiments learned on specific aspects such as effectiveness and side effects,  
(2) the transferability of models among domains, i.e. conditions, and  
(3) the transferability of models among different data sources (see 'Drug Review Dataset (Druglib.com)').  
  
The dataset includes a separate testing dataset (drugsComTest_raw.csv). Use drugsComTrain_raw.csv for building a prediction model. Then measure performance metrics (e.g., accuracy, precision, recall, etc) using drugsComTest_raw.csv file.  

*Source: Felix Gräßer, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder. 2018. Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning. In Proceedings of the 2018 International Conference on Digital Health (DH '18). ACM, New York, NY, USA, 121-125.  
  
Visit https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29 for more information.

In [None]:
# Code reading in the training dataset

import pandas as pd
drugs_train = pd.read_csv('drugsComTrain_raw.csv')

drugs_train.head()

#### Drugs.com Data Dictionary

1. drugName (categorical): name of drug  
2. condition (categorical): name of condition  
3. review (text): patient review  
4. rating (numerical): 10 star patient rating
5. date (date): date of review entry
6. usefulCount (numerical): number of users who found review useful

### Health News Dataset

The data was collected in 2015 using Twitter API. This dataset contains health news from more than 15 major health news agencies such as BBC, CNN, and NYT.  
Each file is related to one Twitter account of a news agency. For example, bbchealth.txt is related to BBC health news. Each line contains tweet id|date and time|tweet. The separator is '|'. This text data has been used to evaluate the performance of topic models on short text data. However, it can be used for other tasks such as clustering.  
  
*Source: Karami, A., Gangopadhyay, A., Zhou, B., & Kharrazi, H. (2017). Fuzzy approach topic discovery in health and medical corpora. International Journal of Fuzzy Systems, 1-12.  
  
Visit https://archive.ics.uci.edu/ml/datasets/Health+News+in+Twitter for more information.

In [None]:
# Code reading in the 'cnnhealth.txt'
file = open('cnnhealth.txt', encoding='utf-8') # open the original txt file

data = {}

for line in file:
    line = line.strip()
    line_elements = line.split('|') # separate four elements (tweet_id, date and tweet) in each line.
    tweet_id = line_elements[0] # 'tweet_id' is the first element in each line.
    time = line_elements[1] # 'date' is the second element in each line.
    tweet = line_elements[2] # 'tweet' is the fourth element in each line.
    data.setdefault(tweet_id,{}).setdefault(date, tweet) # see below
    
file.close()

# Help Functions

Below are a list of functions you may find helpful to use in the project. These functions are all based on the modules. You can simply pick up ones in 

Feel free to refer back to the module. Maybe put your project code and module code side by side if you need. Also to develop your own code is encouraged!

In [5]:
%run TextMiningModule.ipynb

[nltk_data] Downloading package stopwords to c:\users\dapeng\appdata\l
[nltk_data]     ocal\programs\python\python37\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Student Input

**Warning**  
<font color = blue, size = 4> 
    Your work will not be saved in Jupyter Notebook. You are recommended to copy your work and paste it to a safe place to record your work.
<font>

## How To Download Your Work

The project will be much easier if you are able to download you work and save your progress. The link below will guide you to a resource which will provide instructions for setting up Jupyter Notebook on your own computer. This will allow you to download this notebook and save your work.

<a href="https://datamine.unc.edu/wp-content/uploads/2020/06/FAQs.pdf">Link to Instruction</a>

## Project Input

Use the code cell below to perform your analysis

Use the cell below the horizontal line for your writeup. Your writeup should describe your analytic process and your findings from your analysis. 

---