# Drug Sentiment Analysis

## Problem Statement
The dataset provides patient reviews on specific drugs along with related conditions and a 10 star patient rating reflecting overall patient satisfaction. We have to create a target feature out of ratings and predict the sentiment of the reviews.

### Data Description :

The data is split into a train (75%) a test (25%) partition.

* drugName (categorical): name of drug
* condition (categorical): name of condition
* review (text): patient review
* rating (numerical): 10 star patient rating
* date (date): date of review entry
* usefulCount (numerical): number of users who found review useful

The structure of the data is that a patient with a unique ID purchases a drug that meets his condition and writes a review and rating for the drug he/she purchased on the date. Afterwards, if the others read that review and find it helpful, they will click usefulCount, which will add 1 for the variable.

### Import all the necessary packages
Here we have imported the basic packages that are required to do basic processing. Feel free to use any library that you think can be useful here.

In [0]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
from matplotlib import style
style.use('ggplot')

###Load Data

In [0]:
#load the train and test data
test = pd.read_csv('mention-data-path')
train = pd.read_csv('mention-data-path')

###Checking Out The Data

In [0]:
#write code to check the head of train data

In [0]:
#Write code to check head of train data 

In [0]:
#check the shape of the given dataset

#write code to check the shape of train data
#write code to check the shape of test data

In [0]:
#write code to check the columns in train data

## Exploratory Data Analysis

The purpose of EDA is to find out interesting insights and irregularities in our Dataset. We will look at Each feature and try to find out interesting facts and patterns from them. And see whether there is any relationship between the variables or not.

Merge the train and test data as there are no target labels. We will perform our EDA and Pre-processing on merged data. Then we will divide training and testing

In [0]:
#merge train and test data

merge = [train,test]
merged_data = #write code to merge the data

# write code to check the shape of merged_data

### Check number of uniqueIds to see if there's any duplicate record in our dataset

In [0]:
#write code to check the number of unique ids

### Check information of the merged data

In [0]:
#write code to check the information about merged data

### Check the Description

In [0]:
#write code to check the description of merged data
  '''Note -  the description should also include all the categorical variables'''

### Check the Number of null values in each column

In [0]:
#Write code to check null values in merged data

### Check number of unique values in drugName and condition

In [0]:
#check number of unique values in drugName


#check number of unique values in condition


### Check the top 20 conditions

In [0]:
#plot a bargraph to check top 20 conditions
plt.figure(figsize=(12,6))

#write your code to here




### Plot the bottom 20 conditions

In [0]:
#plot a bargraph to check bottom 20 conditions
plt.figure(figsize=(12,6))
# write your code to check the bottom-20 conditions




### Check top 20 drugName

In [0]:
#plot a bargraph to check top 20 drugName
plt.figure(figsize=(12,6))

#write your code here

### Check bottom 20 drugName

In [0]:
#plot a bargraph to check top 20 drugName
plt.figure(figsize=(12,6))

#write your code here

### Checking Ratings Distribution

In [0]:

#Write your code to get the value counts in descending order and reset the index as rating and counts


In [0]:
# plot a bar chart to check the distribution of ratings

### Check the distribution of usefulCount

In [0]:

#Write code to plot a distplot of usefulCount


In [0]:
# Write code to plot a boxplot of usefulCount to see five number summary

### Check number of Drugs per condition

In [0]:
#lets check the number of drugs/condition

#write code to check the number of drugs present per condition

##### Let's look at ''3 <_/span_> user found this comment helpful' in condtions

In [0]:
span_data = #write code to get all the records which has conditions values following the pattern('</span>')

#print span_data

noisy_data_ = #Write code to check percentage span_data out of total records

#print percentage of nosiy_data

In [0]:

#Write code to drop the noisy data 


### Now let's look at the not listed/other

In [0]:
#Write code to check the percentage of 'not listed / othe' conditions in our dataset

In [0]:
#Write code to drop the records where Condition == ''not listed / othe''

### Now Check number of drugs present per condition after removing noise

In [0]:
#lets check top-20 condition with higher number of drugs

#write your code here to plot a bargraph to see the number of drugs per condition (top-20)

### Check bottom 20 drugs per conditions

In [0]:
#Write code to check the number of drugs per condtion bottom-20

### Now let's check if a single drug can be used for Multiple conditions

In [0]:
#let's check if a single drug is used for multiple conditions
drug_multiple_cond = #Write code to get the drugName and for number of conditions it is used for

print(drug_multiple_cond)

### Check the number of drugs with rating 10

In [0]:
#Write code to check the Number of drugs with rating 10.


### Check number of drugs with rating 10

In [0]:
#Check top 20 drugs with rating=10/10

#Write code to check top-20 drugName with rating-10

### Top 10 drugs with 1/10 Rating

In [0]:
#check top 20 drugs with 1/10 rating

#Write your code to check the top-20 drugs with rating 1/10

### Now we will look at the Date column

In [0]:
# convert date to datetime and create year andd month features

merged_data['date'] = pd.to_datetime(merged_data['date'])
merged_data['year'] = merged_data['date'].dt.year  #create year
merged_data['month'] = merged_data['date'].dt.month #create month

### Check Number of reviews per year

In [0]:
#plot number of reviews year wise
count_reviews = merged_data['year'].value_counts().sort_index()

#plot a bargraph to check number of reviews per year

### Check average rating per year

In [0]:
#check average rating per year
yearly_mean_rating = merged_data.groupby('year')['rating'].mean()

#Write code to plot a bargraph showing average rating per year

### Per year drug count and Condition count

In [0]:

year_wise_condition = merged_data.groupby('year')['condition'].nunique()

#plot a bargraph to check the condtions per year

In [0]:
#check drugs year wise

year_wise_drug = merged_data.groupby('year')['drugName'].nunique()

#plot a bargraph to check the drugName per year

## Data Pre-Processing

Data Pre-processing is a vital part in model building. **"Garbage In Garbage Out"**, we all have heard this statement. But what does it mean. It means if we feed in garbage in our data like missing values, and different features which doesn't have any predictive power and provides the same information in our model. Our model will be just making a random guess and it won't be efficient enough for us to use it for any predictions.

We will remove those unwanted features and noise from our data. We also know that we can only feed in numerical values in our model but here we have numerical as well as categorical features as well. We will transform those categorical features into numric values.

In [0]:
# Write code to check the null values


In [0]:
#Write code to drop the null values


### Pre-Processing Reviews

In [0]:
#check first three reviews
for i in merged_data['review'][0:3]:
    print(i,'\n')

"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil" 

"My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. 
We have tried many different medications and so far this is the most effective." 

"I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gesto

### Steps for reviews pre-processing.
* **Remove HTML tags**
     * Using BeautifulSoup from bs4 module to remove the html tags. We have already removed the html tags with pattern "64</_span_>...", we will use get_text() to remove the html tags if there are any.
* **Remove Stop Words**
     * Remove the stopwords like "a", "the", "I" etc.
* **Remove symbols and special characters**
     * We will remove the special characters from our reviews like '#' ,'&' ,'@' etc.
* **Tokenize**
     * We will tokenize the words. We will split the sentences with spaces e.g "I might come" --> "I", "might", "come"
* **Stemming**
     * Remove the suffixes from the words to get the root form of the word e.g 'Wording' --> "Word"

In [0]:
#import the libraries for pre-processing
from bs4 import BeautifulSoup
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

stops = set(stopwords.words('english')) #english stopwords

stemmer = SnowballStemmer('english') #SnowballStemmer

def review_to_words(raw_review):

    # 1. Delete HTML 
    review_text = #Write code to delete all the html tags

    # 2. Make a space
    letters_only = #Write code to set spaces between words

    # 3. lower letters
    words = #write code to lower all the reviews and split it

    # 5. Stopwords 
    meaningful_words = #Write code to remove stopwords

    # 6. Stemming
    stemming_words = #write code to apply stemming on meaningful_words
    # 7. space join words
    return( ' '.join(stemming_words))

In [0]:
#apply review_to_words function on reviews

### Now we will create our target variable "Sentiment" from rating

In [0]:
#create sentiment feature from ratings
#if rating > 5 sentiment = 1 (positive)
#if rating < 5 sentiment = 0 (negative)

#Write your code here

## Building Model

In [0]:
#import all the necessary packages
from sklearn.model_selection import train_test_split #import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer #import TfidfVectorizer 
from sklearn.metrics import confusion_matrix #import confusion_matrix
from sklearn.naive_bayes import MultinomialNB #import MultinomialNB
from sklearn.ensemble import RandomForestClassifier  #import RandomForestClassifier

### TfidfVectorizer (Term frequency - Inverse document frequency)
We all know that we cannot pass raw text features in our model. We have to convert them into numeric values. We will use TfidfVectorizer to convert our reviews in Vectors.\

**TF - Term Frequency** :- 

How often a term t occurs in a document d.

TF = (_Number of occurences of a word in document_) / (_Number of words in that document_)

**Inverse  Document Frequency**

IDF = log(Number of sentences / Number of sentence containing word)

**Tf - Idf = Tf * Idf**


In [0]:
# Creates TF-IDF vectorizer and transforms the corpus
vectorizer = TfidfVectorizer()
reviews_corpus = #fit the vectorizer on reviews
reviews_corpus.shape

### **Store Dependent feature in sentiment and split the Data into train and test**

In [0]:
#dependent feature
sentiment = #Write code to store target feature i.e sentiment in sentiment variable

#write code to check the shape

In [0]:
#split the data in train and test

X_train,X_test,Y_train,Y_test = #Write code to split data into training and testing (test_size = 0.33)

#check shape of training set
#check shape of testing set

### Apply Multinomial Naive Bayes

In [0]:
#fit the model and predicct the output

clf = MultinomialNB() #fit the training data

pred =  #predict the sentiment for test data

#Write code to check accuracy

#print confusion matrix

### Apply RandomForest

In [0]:
#fit the model and predicct the output

clf = RandomForestClassifier() #Write code to fit training data

pred = # Predict the target labels

#Write code to check accuracy

#print confusion matrix

##Parameter Tuning

In [0]:
#try different sets of parameters like n_estimators , max_depth, min_samples_leaf etc and choose the best set of parameters.

## Conclusion
Write down your interpretations about your model and insights here.