<a href="https://colab.research.google.com/github/aryan-bu/BA820/blob/main/Group5_Deliverable2_BA820_Project_Word2Vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Loading

In [None]:
# importing pandas library for data manipulation
import pandas as pd

#URLs for datasets
data_test = 'https://raw.githubusercontent.com/aryan-bu/BA820/main/drug_reviews_dataset/drugsComTest_raw.tsv'
data_train = 'https://raw.githubusercontent.com/aryan-bu/BA820/main/drug_reviews_dataset/drugsComTrain_raw.tsv'

df_test = pd.read_csv(data_test, delimiter='\t')
df_train = pd.read_csv(data_train, delimiter='\t')

In [None]:
#test dataframe
df_test.head()

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,163740,Mirtazapine,Depression,"""I&#039;ve tried a few antidepressants over th...",10.0,"February 28, 2012",22
1,206473,Mesalamine,"Crohn's Disease, Maintenance","""My son has Crohn&#039;s disease and has done ...",8.0,"May 17, 2009",17
2,159672,Bactrim,Urinary Tract Infection,"""Quick reduction of symptoms""",9.0,"September 29, 2017",3
3,39293,Contrave,Weight Loss,"""Contrave combines drugs that were used for al...",9.0,"March 5, 2017",35
4,97768,Cyclafem 1 / 35,Birth Control,"""I have been on this birth control for one cyc...",9.0,"October 22, 2015",4


In [None]:
#train dataframe
df_train.head()

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9.0,"May 20, 2012",27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9.0,"November 27, 2016",37


In [None]:
df_test.shape

(53766, 7)

In [None]:
df_train.shape

(161297, 7)

In [None]:
#concatenating the test and train dataframes vertically
df = pd.concat([df_test, df_train])
df.shape

(215063, 7)

In [None]:
#renaming the 'Unnamed: 0' column to 'index'
df = df.rename(columns={'Unnamed: 0': 'index'})

#sorting values by index
df = df.sort_values(by='index', ascending=True)
df.head()

Unnamed: 0,index,drugName,condition,review,rating,date,usefulCount
47805,0,Medroxyprogesterone,Abnormal Uterine Bleeding,"""Been on the depo injection since January 2015...",3.0,"October 28, 2015",4
93135,2,Medroxyprogesterone,Amenorrhea,"""I&#039;m 21 years old and recently found out ...",10.0,"October 27, 2015",11
143331,3,Medroxyprogesterone,Abnormal Uterine Bleeding,"""I have been on the shot 11 years and until a ...",8.0,"October 27, 2015",7
57030,4,Medroxyprogesterone,Birth Control,"""Ive had four shots at this point. I was on bi...",9.0,"October 26, 2015",12
106347,5,Medroxyprogesterone,Abnormal Uterine Bleeding,"""I had a total of 3 shots. I got my first one ...",1.0,"October 25, 2015",4


## Preprocessing

### Stop word removal, stemming and lemmatizing

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

#creating a set of English stopwords from NLTK
stop_words = set(stopwords.words('english'))

#defining a function to remove stopwords, stem and lemmatize
def process_text(text):
    words = nltk.word_tokenize(text) #Tokenizing the input text into words
    filtered_words = [word for word in words if word.lower() not in stop_words] #filtering out stopwords from the tokenized words
    stemmed = [stemmer.stem(token) for token in filtered_words]
    lemmatized = [lemmatizer.lemmatize(token) for token in stemmed]
    return ' '.join(lemmatized) #joining the filtered words back into a single string

df['processed_review'] = df['review'].apply(process_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
df['processed_review']

47805     `` depo inject sinc januari 2015 , bleed stop ...
93135     `` & # 039 ; 21 year old recent found might pc...
143331    `` shot 11 year month ago , never 1 period eve...
57030     `` ive four shot point . birth control pill ye...
106347    `` total 3 shot . got first one leav hospit gi...
                                ...                        
59738     `` & # 039 ; ever use , 4 year old sick doctor...
81768     `` acut maxillari sinus . day two , take third...
135055    `` took amox clav 2x day 7 day urinari tract i...
41572     `` day 1 - seriou pain diminish hear right ear...
22470     `` everi time sinu infect prescrib augmentin 8...
Name: processed_review, Length: 215063, dtype: object

### Apostrophe Removed

In place of the ' symbol, we have some unwanted characters. We can remove them for our analysis.

In [None]:
#replacing the string '& # 039 ;' with an empty string in the 'review' column
df['processed_review'] = df['processed_review'].str.replace('& # 039 ;', '')
df.head()

Unnamed: 0,index,drugName,condition,review,rating,date,usefulCount,processed_review
47805,0,Medroxyprogesterone,Abnormal Uterine Bleeding,"""Been on the depo injection since January 2015...",3.0,"October 28, 2015",4,"`` depo inject sinc januari 2015 , bleed stop ..."
93135,2,Medroxyprogesterone,Amenorrhea,"""I&#039;m 21 years old and recently found out ...",10.0,"October 27, 2015",11,`` 21 year old recent found might pco . gott...
143331,3,Medroxyprogesterone,Abnormal Uterine Bleeding,"""I have been on the shot 11 years and until a ...",8.0,"October 27, 2015",7,"`` shot 11 year month ago , never 1 period eve..."
57030,4,Medroxyprogesterone,Birth Control,"""Ive had four shots at this point. I was on bi...",9.0,"October 26, 2015",12,`` ive four shot point . birth control pill ye...
106347,5,Medroxyprogesterone,Abnormal Uterine Bleeding,"""I had a total of 3 shots. I got my first one ...",1.0,"October 25, 2015",4,`` total 3 shot . got first one leav hospit gi...


### Lower casing the text

In [None]:
df['processed_review'] = df['processed_review'].str.lower()

In [None]:
df['processed_review']

47805     `` depo inject sinc januari 2015 , bleed stop ...
93135     ``  21 year old recent found might pco .  gott...
143331    `` shot 11 year month ago , never 1 period eve...
57030     `` ive four shot point . birth control pill ye...
106347    `` total 3 shot . got first one leav hospit gi...
                                ...                        
59738     ``  ever use , 4 year old sick doctor give aug...
81768     `` acut maxillari sinus . day two , take third...
135055    `` took amox clav 2x day 7 day urinari tract i...
41572     `` day 1 - seriou pain diminish hear right ear...
22470     `` everi time sinu infect prescrib augmentin 8...
Name: processed_review, Length: 215063, dtype: object

In [None]:
#getting the count of null values in each column
df.isnull().sum()

index                  0
drugName               0
condition           1194
review                 0
rating                 0
date                   0
usefulCount            0
processed_review       0
dtype: int64

In [None]:
#dropping all the rows with null values
df = df.dropna()
df.isnull().sum()

index               0
drugName            0
condition           0
review              0
rating              0
date                0
usefulCount         0
processed_review    0
dtype: int64

In [None]:
#finding duplicates
duplicates = df[df.duplicated()]
duplicates

Unnamed: 0,index,drugName,condition,review,rating,date,usefulCount,processed_review


There are no duplicate values as well and we can move ahead with our analysis. Since the dataset is large, we can randomly select 20,000 rows for initial analysis.

### Removing Punctuation

In [None]:
import pandas as pd
import string

#function to remove punctuation from the text
def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)#creating a translation table to remove punctuation
    return text.translate(translator)

df['processed_review'] = df['processed_review'].apply(remove_punctuation)

## Sentiment Categorization

In [None]:
df.rating.unique()

array([ 3., 10.,  8.,  9.,  1.,  5.,  2.,  7.,  4.,  6.])

In [None]:
#assigning sentiment labels based on the rating values
df['sentiment'] = df['rating'].apply(lambda x: 'negative' if x <= 3 else ('neutral' if x <= 6 else 'positive'))

In [None]:
df.to_csv('processed_reviews.csv', index=False)