Importing the libraries

In [1]:
import pandas as pd
import html

Import test and train dataset and concatenate them

In [2]:
train = pd.read_csv('drugsCom_raw/drugsComTrain_raw.tsv', sep='\t')
print(f'Train shape: {train.shape}')
test = pd.read_csv('drugsCom_raw/drugsComTest_raw.tsv', sep='\t')
print(f'Test shape: {test.shape}')

dataset_all_raw = pd.concat([train, test], axis=0)
print(f'Dataset shape: {dataset_all_raw.shape}')
missing_data = dataset_all_raw.isna().sum()
print(f'Missing data in concatenated dataset:\n{missing_data}')

Train shape: (161297, 7)
Test shape: (53766, 7)
Dataset shape: (215063, 7)
Missing data in concatenated dataset:
Unnamed: 0        0
drugName          0
condition      1194
review            0
rating            0
date              0
usefulCount       0
dtype: int64


Drop the rows with missing data
Drop date and usefulCount and unnamed  columns

In [3]:
dataset_all_raw.keys()

Index(['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date',
       'usefulCount'],
      dtype='object')

In [20]:
dataset_all_raw.dropna(inplace=True)
dataset_all_raw.drop(['date', 'usefulCount','Unnamed: 0'], axis=1, inplace=True)
print(f'Dataset shape after drop operations: {dataset_all_raw.shape}')

KeyError: "['date', 'usefulCount', 'Unnamed: 0'] not found in axis"

In some samples review contains <span> values we will drop them

In [21]:
dataset_all_raw_span_dropped = dataset_all_raw[~dataset_all_raw['condition'].str.contains('span')]
print(f'Dataset shape after drop span values: {dataset_all_raw_span_dropped.shape}')

Dataset shape after drop span values: (212698, 4)


Decode the html characters and create a new column with the decoded review

In [24]:
def html_decoder(row):
    return html.unescape(row["review"])

In [25]:
dataset_all_raw_span_dropped['plain_text'] = dataset_all_raw_span_dropped.apply (lambda row: html_decoder(row), axis=1,)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset_all_raw_span_dropped['plain_text'] = dataset_all_raw_span_dropped.apply (lambda row: html_decoder(row), axis=1,)


In [26]:
dataset_all_raw_span_dropped["text"] = dataset_all_raw_span_dropped["plain_text"].apply(lambda  x: x.strip('"'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset_all_raw_span_dropped["text"] = dataset_all_raw_span_dropped["plain_text"].apply(lambda  x: x.strip('"'))


In [27]:
dataset_all_raw_span_dropped.drop(['review', 'plain_text'], axis=1, inplace=True)
dataset_all_raw_span_dropped.rename(columns={'text': 'review'}, inplace=True)
print(f'Dataset shape after html decoding: {dataset_all_raw_span_dropped.shape}')
print(f'Columns in dataset: {dataset_all_raw_span_dropped.keys()}')

Dataset shape after html decoding: (212698, 4)
Columns in dataset: Index(['drugName', 'condition', 'rating', 'review'], dtype='object')


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset_all_raw_span_dropped.drop(['review', 'plain_text'], axis=1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset_all_raw_span_dropped.rename(columns={'text': 'review'}, inplace=True)


Add ratingLabel column to the dataset

In [28]:
def create_label(row):
    if row["rating"]<= 4:
        return  -1;
    if row["rating"]>= 7:
        return  1;
    return 0;

In [29]:
dataset_all_raw_span_dropped['ratingLabel'] = dataset_all_raw_span_dropped.apply (lambda row: create_label(row), axis=1)
print(f'Dataset shape after adding ratingLabel column: {dataset_all_raw_span_dropped.shape}')
print(f'Columns in dataset: {dataset_all_raw_span_dropped.keys()}')

Dataset shape after adding ratingLabel column: (212698, 5)
Columns in dataset: Index(['drugName', 'condition', 'rating', 'review', 'ratingLabel'], dtype='object')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset_all_raw_span_dropped['ratingLabel'] = dataset_all_raw_span_dropped.apply (lambda row: create_label(row), axis=1)


Now we can save the dataset to a csv file

In [30]:
dataset_all_raw_span_dropped.to_csv('cleaned_and_labeled_dataset.csv', index=False)