![title](inn.png)

# Innoplexus Online Hiring Hackathon: Sentiment Analysis

## Problem Statement

### Sentiment Analysis for drugs/medicines
Nowadays the narrative of a brand is not only built and controlled by the company that owns the brand. For this reason, companies are constantly looking out across Blogs, Forums, and other social media platforms, etc for checking the sentiment for their various products and also competitor products to learn how their brand resonates in the market. This kind of analysis helps them as part of their post-launch market research. This is relevant for a lot of industries including pharma and their drugs.
 

**The challenge is that the language used in this type of content is not strictly grammatically correct. Some use sarcasm. Others cover several topics with different sentiments in one post. Other users post comments and reply and thereby indicating his/her sentiment around the topic.**

Sentiment can be clubbed into 3 major buckets - **Positive, Negative and Neutral Sentiments.**


You are provided with data containing samples of text. This text can contain one or more drug mentions. Each row contains a unique combination of the text and the drug mention. Note that the same text can also have different sentiment for a different drug.

Given the text and drug name, the task is to predict the sentiment for texts contained in the test dataset. Given below is an example of text from the dataset:


Example:

*Stelara is still fairly new to Crohn's treatment. This is why you might not get a lot of replies. I've done some research, but most of the "time to work" answers are from Psoriasis boards. For Psoriasis, it seems to be about 4-12 weeks to reach a strong therapeutic level. The good news is, Stelara seems to be getting rave reviews from Crohn's patients. It seems to be the best med to come along since Remicade. I hope you have good success with it. My daughter was diagnosed Feb. 19/07, (13 yrs. old at the time of diagnosis), with Crohn's of the Terminal Illium. Has used Prednisone and Pentasa. Started Imuran (02/09), had an abdominal abscess (12/08). 2cm of Stricture. Started ​Remicade in Feb. 2014, along with 100mgs. of Imuran.*


For Stelara the above text is **positive** while for Remicade the above text is **negative**.

### Data Description
**train.csv**
Contains the labelled texts with sentiment values for a given drug
 
|Variable|	Definition|
|----|----|
|unique_hash |Unique ID|
|text|text pertaining to the drugs|
|drug |drug name for which the sentiment is provided|
|sentiment	|(Target) 0-positive, 1-negative, 2-neutral  |


**test.csv**
test.csv contains texts with drug names for which the participants are expected to predict the correct sentiment
 

### Evaluation Metric
The metric used for evaluating the performance of the classification model would be macro F1-Score.
 

## Public and Private Split

The texts in the test data are further randomly divided into Public (40%) and Private (60%) data.
Your initial responses will be checked and scored on the Public data.
The final rankings would be based on your private score which will be published once the competition is over.

# Approaches



# Leaderboard

In [1]:
from google.colab import drive
drive.mount('/content/gdrive/')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive/


In [0]:
root_path = 'gdrive/My Drive/AV/AV INnoplexus/'

In [0]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
import textblob

import os
# print(os.listdir("../input"))
os.environ['PYTHONHASHSEED'] = '10000'
np.random.seed(10001)
import random
import tensorflow as tf
random.seed(10002)
session_conf = tf.ConfigProto(intra_op_parallelism_threads=6, inter_op_parallelism_threads=5)
from keras import backend

tf.set_random_seed(10003)
backend.set_session(tf.Session(graph=tf.get_default_graph(), config=session_conf))

In [0]:
train=pd.read_csv(root_path+'train.csv')
test=pd.read_csv(root_path+'test.csv')
s=pd.read_csv(root_path+'sample_submission.csv')

In [345]:
enc = OneHotEncoder(sparse=False)
enc.fit(train["sentiment"].values.reshape(-1, 1))
print("Number of classes:", enc.n_values_[0])

print("Class distribution:\n{}".format(train["sentiment"].value_counts()/train.shape[0]))

Number of classes: 3
Class distribution:
2    0.724569
1    0.158553
0    0.116878
Name: sentiment, dtype: float64


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [346]:
all_drugs=train.drug.unique()
all_drugs

array(['gilenya', 'fingolimod', 'ocrevus', 'cladribine', 'humira',
       'tagrisso', 'lucentis', 'pan-retinal photocoagulation', 'remicade',
       'stelara', 'ocrelizumab', 'dexamethasone', 'pemetrexed', 'cimzia',
       'tarceva', 'nivolumab', 'tecentriq', 'ipilimumab', 'mekinist',
       'opdivo', 'dexamethasone implant', 'eylea', 'erlotinib',
       'alectinib', 'entyvio', 'crizotinib', 'keytruda', 'mavenclad',
       'osimertinib', 'vedolizumab', 'atezolizumab', 'durvalumab',
       'alimta', 'tysabri', 'avastin', 'golimumab', 'tofacitinib',
       'ixifi', 'teriflunomide', 'ranibizumab', 'afatinib',
       'upadacitinib', 'zykadia', 'ustekinumab', 'xalkori',
       'pembrolizumab', 'lemtrada', 'siponimod', 'simponi', 'inflectra',
       'entrectinib', 'yervoy', 'vitrectomy', 'bevacizumab', 'gefitinib',
       'amjevita', 'lorlatinib', 'pemrolizumab', 'tafinlar',
       'infliximab-dyyb', 'ozurdex', 'gilotrif', 'imfinzi', 'iressa',
       'laser photocoagulation', 'renflexis', 'a

In [0]:
common=np.intersect1d(train.text,test.text)
test_common=test.query('text in @common')
train=train.query('text not in @common')
test=test.query('text not in @common')


In [349]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [350]:
#
#     df["phrase_count"] = df.groupby("drug")["text"].transform("count")
# 

def transform(df):
#     df["phrase_count"] = df.groupby("drug")["text"].transform("count")
  
  
    df['drug_count']=df['text'].apply(lambda x: len(np.intersect1d(x.split(),all_drugs)))
#     df["word_count"] = df["text"].apply(lambda x: len(x.split()))
    df["has_upper"] = df["text"].apply(lambda x: x.lower() != x).map({True:1,False:0})
    df['upper'] = df['text'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
    df["sentence_end"] = df["text"].apply(lambda x: x.endswith(".")).map({True:1,False:0})
    df["after_comma"] = df["text"].apply(lambda x: x.startswith(",")).map({True:1,False:0})
    df["sentence_start"] = df["text"].apply(lambda x: "A" <= x[0] <= "Z").map({True:1,False:0})
    df["text"] = df["text"].apply(lambda x: x.lower())
    import string
    punctuation=string.punctuation
    df['word_count']=df['text'].apply(lambda x: len(str(x).split(" ")))
    df['char_count'] = df['text'].str.len()
    def avg_word(sentence):
        words = sentence.split()
        return (sum(len(word) for word in words)/len(words))

    df['avg_word'] = df['text'].apply(lambda x: avg_word(x))
    from nltk.corpus import stopwords
    stop = stopwords.words('english')

    df['stopwords'] = df['text'].apply(lambda x: len([x for x in x.split() if x in stop]))
    df['numerics'] = df['text'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
    
    df['word_density'] = df['char_count'] / (df['word_count']+1)
    df['punctuation_count'] = df['text'].apply(lambda x: len("".join(_ for _ in x if _ in punctuation))) 
#     df['drug']=pd.factorize(df['drug'])[0]
#     df["text"]=df["text"].apply(lambda x: " ".join([a for a in x.split() if a not in all_drugs]))
    return df

train = transform(train)
test = transform(test)

# dense_features = ["phrase_count", "word_count", "has_upper", "after_comma", "sentence_start", "sentence_end",'char_count','avg_word','stopwords','numerics','word_density','punctuation_count','drug','upper']
dense_features = [ "word_count", "has_upper", "after_comma", "sentence_start", "sentence_end",'drug_count','char_count','avg_word','stopwords','numerics','word_density','punctuation_count','upper']
train.groupby("sentiment")[dense_features].mean()

Unnamed: 0_level_0,word_count,has_upper,after_comma,sentence_start,sentence_end,drug_count,char_count,avg_word,stopwords,numerics,word_density,punctuation_count,upper
sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,308.270665,0.998379,0.0,0.87034,0.465154,0.340357,1847.124797,4.89372,123.0859,3.735818,5.770187,54.28201,13.423015
1,251.029869,0.995221,0.001195,0.882915,0.451613,0.311828,1425.449223,4.681141,110.222222,2.487455,5.581805,40.305854,10.480287
2,296.737579,0.994665,0.001,0.899633,0.444815,0.353118,1783.860287,4.872842,116.764255,3.466155,5.747094,55.731577,13.31077


In [0]:
from nltk.stem import PorterStemmer,SnowballStemmer
# st = SnowballStemmer('english')
# df_train['text']=df_train['text'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
# df_test['text']=df_test['text'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
from nltk.stem import WordNetLemmatizer
import re
import nltk
from nltk.corpus import stopwords
def url_to_words(raw_text):
    raw_text=raw_text.strip()
    no_coms=re.sub(r'\.com','',raw_text)
    no_urls=re.sub('https?://www','',no_coms)
    no_urls1=re.sub('https?://','',no_urls)
    try:
        no_encoding=no_urls1.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        no_encoding = no_urls1
    letters_only = re.sub("[^a-zA-Z0-9]", " ",no_encoding) 
    words = letters_only.split()                             
    stops = stopwords.words('english')         
    meaningful_words = [w for w in words if not w in stops]
#     st = SnowballStemmer('english')
#     d=[st.stem(word) for word in meaningful_words]
#     lemmatizer=WordNetLemmatizer()
#     dd=[lemmatizer.lemmatize(word) for word in d]
    return( " ".join( meaningful_words ))



In [0]:
train['text']=train['text'].apply(url_to_words)
test['text']=test['text'].apply(url_to_words)

In [353]:
from sklearn.utils import shuffle
# train=shuffle(train)
train.shape

(4453, 17)

In [354]:
train.sentiment.value_counts()

2    2999
1     837
0     617
Name: sentiment, dtype: int64

In [355]:

rem_col=['text','sentiment','unique_hash','drug']
col=[v for v in list(train.columns) if v not in rem_col] 
col

['drug_count',
 'has_upper',
 'upper',
 'sentence_end',
 'after_comma',
 'sentence_start',
 'word_count',
 'char_count',
 'avg_word',
 'stopwords',
 'numerics',
 'word_density',
 'punctuation_count']

In [0]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
v_name = TfidfVectorizer(ngram_range=(1,3),stop_words="english", analyzer='word',max_features=50000)
name_tr =v_name.fit_transform(train['text'])
name_ts =v_name.transform(test['text'])


vc_name = TfidfVectorizer(ngram_range=(1,7),stop_words="english", analyzer='char',max_features=50000)
name_tcr =vc_name.fit_transform(train['text'])
name_tcs =vc_name.transform(test['text'])

In [0]:
from scipy.sparse import csr_matrix
from scipy import sparse
final_features = sparse.hstack((train[col],name_tr,name_tcr )).tocsr()
final_featurest = sparse.hstack((test[col],name_ts,name_tcs )).tocsr()

In [0]:
from sklearn.model_selection import train_test_split
import math
from sklearn.metrics import accuracy_score,f1_score,mean_squared_error,mean_squared_log_error
X=final_features
y=train['sentiment']
X_train,X_val,y_train,y_val = train_test_split(X,y,test_size=0.25,random_state = 1994)

In [361]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
# m=MultinomialNB(alpha=0.00000000001)
# m.fit(X_train,y_train)
# p=m.predict(X_val)

# {2:1,1:4,0:7}
m=OneVsRestClassifier(LogisticRegression(class_weight='balanced',random_state=1994))
# m=OneVsRestClassifier(SGDClassifier(class_weight='balanced',loss='log'))
m.fit(X_train,y_train)
p=m.predict(X_val)

# m=RandomForestClassifier(class_weight='balanced',random_state=1994,max_depth=17,max_features=50000)
# m.fit(X_train,y_train)
# p=m.predict(X_val)

print(f1_score(y_val,p,average='macro'))



0.5329982957993784


In [362]:
m.fit(X,y)
pp=m.predict(final_featurest)



In [363]:
test["sentiment"] = pp

test_common['sentiment']=2
sub=test[["unique_hash", "sentiment"]]
sub=sub.append(test_common[["unique_hash", "sentiment"]],ignore_index=True)


# sub=test[["unique_hash", "sentiment"]]
sub["sentiment"] = sub["sentiment"].astype(int)
sub.to_csv("subLocalOvO_LR7_colab.csv", index=False)
sub.sentiment.value_counts()

2    2103
1     483
0     338
Name: sentiment, dtype: int64