# NLP Assignment Notebook

## Problem Statement
We are given two objective for this assignmetn 
1. Get the most frequent entities from the tweets
2. Find polarity/sentiment of each author towards entities

## Getting Data
The Data is provided in as tweets.json file consisting of nested dictionary. Keys of which are the tweet ID and inner dictionary contain author name and the tweet he made

Before Starting to work on the objective of the project we need to clean the tweets. Which is the first step

In [1]:
### This cell contains all the tools we will be using for this project
import pandas as pd
import numpy as np
import re
import string

import spacy

## Importing Data

In [2]:
### Reading the the json file containing tweets
df = pd.read_json('./tweets.json')

In [3]:
df.head()

Unnamed: 0,2013-07-18 09:39:46.071961602,2013-07-17 03:40:32.173842437,2013-07-15 15:41:16.553048065,2013-07-12 19:19:42.367813635,2013-07-04 12:40:34.334232586,2013-07-04 08:44:42.278539265,2013-07-04 04:22:03.305394179,2013-07-03 21:48:41.159868423,2013-07-03 15:55:15.081797632,2013-07-03 04:25:53.837944834,...,1987-06-22 19:36:28.372967425,1987-06-22 13:38:05.220745216,1987-06-20 01:14:46.517178368,1987-06-19 13:30:22.587748353,1987-06-19 13:03:48.404117505,1987-06-19 12:17:53.643945985,1987-06-19 12:06:26.675290112,1987-06-17 23:05:41.186953217,1987-06-17 15:18:00.525635584,1987-06-13 10:44:06.537678849
tweet_author,Hematopoiesis News,"Michael Wang, MD",1stOncology,Toby Eyre,Lymphoma Hub,David Ledger,N Wales Cancer Forum,European Pharmaceutical Review,Graham Collins,CLL Ireland,...,C A R E N,Werneth Cricket,"John P. Leonard, MD",Joy is a Lifestyle,Micheál 🇮🇪,Joy is a Lifestyle,𝓒𝓻𝓲𝔃𝔃𝔂 𝓟𝓮𝓻𝓻𝔂🌹,IQWiG,Medibooks,Medibooks
tweet_text,⚕️ Scientists conducted a Phase II study of ac...,This phase 2 Acalabrutinib-Venetoclax (AV) tri...,#NICE backs #AstraZenecas #Calquence for #CLL ...,#acalabrutinib is a valuable option in pts int...,NICE has recommended the use of acalabrutinib ...,NICE backs AstraZeneca’s Calquence for CLL htt...,This is England for now - these decisions usua...,"AstraZeneca’s Calquence (acalabrutinib), a che...",Superstar @tobyeyre82 responding to the excell...,CLL patients all know the drug Ibrutinib and y...,...,I miss them! 😋😆😅\n\n#FotoRus #FriendshipForeve...,"The fixtures are out, first team will travel t...",Partnering @GileadSciences &amp;Ono BTKi-combo...,Hanging out with Friends! :) #FF #CLL #Happine...,What I'd do to go to Gerrard's last game at An...,Hanging out with Friends! :) #FF #CLL #Happine...,Hanging out with Friends! :) #FF #CLL #Happine...,Zusatznutzen von #Idelalisib ist weder für #CL...,#Hematología PTK2 EXPRESSION AND IMMUNOCHEMOTH...,#Hematología MUTATIONS IN TLR/MYD88 PATHWAY ID...


The dataframe has tweet_author and tweet_text as rows. We could simply transpose the dataframe and reset the index from tweet_id to integers

In [4]:
# Transposing the data frame to get the required columns
df_trans = df.T
df_trans.head()

Unnamed: 0,tweet_author,tweet_text
2013-07-18 09:39:46.071961602,Hematopoiesis News,⚕️ Scientists conducted a Phase II study of ac...
2013-07-17 03:40:32.173842437,"Michael Wang, MD",This phase 2 Acalabrutinib-Venetoclax (AV) tri...
2013-07-15 15:41:16.553048065,1stOncology,#NICE backs #AstraZenecas #Calquence for #CLL ...
2013-07-12 19:19:42.367813635,Toby Eyre,#acalabrutinib is a valuable option in pts int...
2013-07-04 12:40:34.334232586,Lymphoma Hub,NICE has recommended the use of acalabrutinib ...


In [5]:
### Changing index from timestamp to numbers
idx = np.array(range(len(df_trans))) ### array of numbers equal to length of dataframe 
tweet_df = df_trans.set_index(idx)

In [6]:
### Final df
tweet_df.head()

Unnamed: 0,tweet_author,tweet_text
0,Hematopoiesis News,⚕️ Scientists conducted a Phase II study of ac...
1,"Michael Wang, MD",This phase 2 Acalabrutinib-Venetoclax (AV) tri...
2,1stOncology,#NICE backs #AstraZenecas #Calquence for #CLL ...
3,Toby Eyre,#acalabrutinib is a valuable option in pts int...
4,Lymphoma Hub,NICE has recommended the use of acalabrutinib ...


In [12]:
tweet_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43347 entries, 0 to 43346
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   tweet_author  43347 non-null  object
 1   tweet_text    43347 non-null  object
dtypes: object(2)
memory usage: 1015.9+ KB


## Exploring and Cleaning Tweets

In [7]:
### Counting total number of unique authors
tweet_df['tweet_author'].nunique()

9292

In [8]:
### Counting total number of unique authors
tweet_df['tweet_text'].nunique()

41776

We can see here that the unique number of tweets are less than the total tweets present in the data. Let's make sure there is no repeated instance in our data frame to get more accurate measure of frequency and polarity.

In [9]:
### Counting duplicate rows in the dataframe
tweet_df.duplicated().value_counts()

False    41818
True      1529
dtype: int64

We can see that there are around 1529 duplicate rows in the dataframe. We are going to drop these rows out of the dataframe to avoid redundancy of instances for our main objectives

In [12]:
### Removing duplicate rows
tweet_df.drop_duplicates(inplace=True)

In [13]:
tweet_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41818 entries, 0 to 43346
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   tweet_author  41818 non-null  object
 1   tweet_text    41818 non-null  object
dtypes: object(2)
memory usage: 980.1+ KB


In [16]:
### We need to reset index
tweet_df.reset_index(drop = True,inplace = True)

In [17]:
### Cross checking 
tweet_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41818 entries, 0 to 41817
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   tweet_author  41818 non-null  object
 1   tweet_text    41818 non-null  object
dtypes: object(2)
memory usage: 653.5+ KB


**Note:-** At this point I also tried to remove tweets which were not in english because I thought every language has different grammer rules and it would affect the accuracy of entity extraction which is based on grammer rules.
However, I was unable to do so because I was unable to resolve the errors I received.

I tried to use textblob and spacy to do so.

### Tweet Cleaning

Now we need to clean our tweets to be able to process it 

We will be performing basic cleaning process by removing
 * Hashtags and mention
   - We will be removing hastags and mentions because they contains names of people and abbrevation and often redundant with main text
 * Punctuation
 * Newline 
 * Digits
 * emojis and symbols
 * Links and url
 
I am mostly using regular expression for this step available I found at google

**Note** - I was confused about removing hashtags whether to remove them or keep them as they could act as entities but I finally decided to drop them as they can be most of the time redundant with aim of main text and can contain lots of abbreviation.

In [19]:
### Defining emoji pattern to be removed
emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)# symbols & pictographs
        "]+", flags=re.UNICODE)

def clean_tweets(txt):
    ### Removing endline
    txt = txt.replace('\n','')
    ### Removing digits from tweets
    txt = re.sub(r"\d", "", txt)
    ### Removing mentions
    txt = re.sub("@[A-Za-z0-9_]+","", txt)
    ### Removing hashtags
    txt = re.sub("#[A-Za-z0-9_]+","", txt)
    ### removing puctuations from the tweet
    no_punc = [char for char in txt if char not in string.punctuation]
    txt = ''.join(no_punc) 
    ### Removing emojis
    txt = emoji_pattern.sub(r'',txt)    
    ### Removing links
    txt = re.sub(r"http\S+", '',txt, flags=re.MULTILINE)
    return txt

In [22]:
### Applying the clean tweet function to tweet_text column
tweet_df['Clean_text'] = tweet_df['tweet_text'].apply(clean_tweets)

In [23]:
tweet_df['Clean_text'][1]

'This phase  AcalabrutinibVenetoclax AV trial that is still in recruitment phase will study how well venetoclax and acalabrutinib works in MCL patients who either relapsed or nonrespondent to the initial therapy'

In [24]:
### Dropping Original tweet_text from the dataframe
tweet_df.drop('tweet_text',axis = 1,inplace = True)

In [25]:
tweet_df.head(10)

Unnamed: 0,tweet_author,Clean_text
0,Hematopoiesis News,⚕️ Scientists conducted a Phase II study of ac...
1,"Michael Wang, MD",This phase AcalabrutinibVenetoclax AV trial t...
2,1stOncology,backs for
3,Toby Eyre,is a valuable option in pts intolerant to Fu...
4,Lymphoma Hub,NICE has recommended the use of acalabrutinib ...
5,David Ledger,NICE backs AstraZeneca’s Calquence for CLL
6,N Wales Cancer Forum,This is England for now these decisions usual...
7,European Pharmaceutical Review,AstraZeneca’s Calquence acalabrutinib a chemot...
8,Graham Collins,Superstar responding to the excellent news of...
9,CLL Ireland,CLL patients all know the drug Ibrutinib and y...


In [26]:
tweet_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41818 entries, 0 to 41817
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   tweet_author  41818 non-null  object
 1   Clean_text    41818 non-null  object
dtypes: object(2)
memory usage: 653.5+ KB


Everything looks good till now

Since there not much options to explore our data or visualize it as there are just two columns we can move on towards extracting entities from our Tweets

## Extracting Entities using Spacy
According to the examples given in the problem statement, what I figured out was that we have to consider Noun phrase as entities. I tried Named entity feature of spacy to extract entities but it most of the time missed entities which do not fall into it's predefined catagory. So after various experiment this is the following process I came up to extract entities with maximum efficiency.

In [28]:
### Using noun_chunk feature to extract all Noun phrase from the tweets
NLP = spacy.load("en_core_web_sm")

def Noun_phrase_extraction(tweet):  
    ### Converting tweet to an spacy object
    doc = NLP(tweet)
    np_list = []
    for np in doc.noun_chunks:
        ### Getting noun chunks from the text and appending it to the list of Noun phrases from the text
        np_list.append(np.text)

    return np_list


In [29]:
### Applying the Noun_phrase_extraction function to clean text column and storing it as a new column
tweet_df['Noun_phrase'] = tweet_df['Clean_text'].apply(Noun_phrase_extraction)

In [30]:
tweet_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41818 entries, 0 to 41817
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   tweet_author  41818 non-null  object
 1   Clean_text    41818 non-null  object
 2   Noun_phrase   41818 non-null  object
dtypes: object(3)
memory usage: 980.2+ KB


In [32]:
tweet_df['Noun_phrase'][0]

['⚕️ Scientists',
 'a Phase II study',
 'acalabrutinib',
 'patients',
 'relapsedrefractory',
 'who',
 'an overall response rate']

In [33]:
tweet_df['Noun_phrase'][1]

['This phase',
 'AcalabrutinibVenetoclax AV trial',
 'recruitment phase',
 'how well venetoclax and acalabrutinib works',
 'MCL patients',
 'who',
 'the initial therapy']

We can see here that there our various noun phrase that we cannot consider as entities like "who","an overall response rate","the initial therapy" and so on. To remove words like "Who","a","this" etc we will remove the stop words out of the noun phrase. 

We did not remove the stop words earlier because it can affect the meaning and grammer of the sentance.

And to remove sentances like initial therapy we need to remove the entities which contain words having dependancy as adjective modifier.I came accross which dependancy to remove by using spacy's part of speech and dependancy tagging feature. There is a possibility there are other dependancy that needs to be removed to improve the efficiancy of entities which I didn't figure out.

In [36]:
raw_text = 'the initial therapy'
text1= NLP(raw_text)

for word in text1:
    print(word.text,  word.pos_,word.dep_)


the DET det
initial ADJ amod
therapy NOUN ROOT


In [37]:
### Function to first remove stopwords out of Noun phrase and then check if it contains amod remove it entirely
def parse_entity(np):
    ent_list = []
    for phrase in np:
        doc_p = NLP(phrase)
        s = ''
        for token in doc_p:
            if not token.is_stop:
                s = ' '.join((s,token.text))
        doc_et = NLP(s)  
        for token in doc_et:
            if token.dep_ == 'amod':
                s = ''
                break
        if s == '':
            break
        ent_list.append(s)
    return ent_list

In [38]:
### Checking it on one list
x = parse_entity(tweet_df['Noun_phrase'][0])

In [39]:
x

[' ⚕ ️ Scientists',
 ' Phase II study',
 ' acalabrutinib',
 ' patients',
 ' relapsedrefractory']

In [40]:
### Applying it to entire dataframe/Noun phrase column
tweet_df['Entity_list'] = tweet_df['Noun_phrase'].apply(parse_entity)

In [41]:
tweet_df.head()

Unnamed: 0,tweet_author,Clean_text,Noun_phrase,Entity_list
0,Hematopoiesis News,⚕️ Scientists conducted a Phase II study of ac...,"[⚕️ Scientists, a Phase II study, acalabrutini...","[ ⚕ ️ Scientists, Phase II study, acalabruti..."
1,"Michael Wang, MD",This phase AcalabrutinibVenetoclax AV trial t...,"[This phase, AcalabrutinibVenetoclax AV trial,...","[ phase, AcalabrutinibVenetoclax AV trial, r..."
2,1stOncology,backs for,[ backs],[ backs]
3,Toby Eyre,is a valuable option in pts intolerant to Fu...,"[a valuable option, Further valuable data, Ear...",[]
4,Lymphoma Hub,NICE has recommended the use of acalabrutinib ...,"[NICE, the use, acalabrutinib, patients, treat...","[ NICE, use, acalabrutinib, patients]"


In [42]:
tweet_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41818 entries, 0 to 41817
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   tweet_author  41818 non-null  object
 1   Clean_text    41818 non-null  object
 2   Noun_phrase   41818 non-null  object
 3   Entity_list   41818 non-null  object
dtypes: object(4)
memory usage: 1.3+ MB


In [44]:
tweet_df['Entity_list'][1]

[' phase',
 ' AcalabrutinibVenetoclax AV trial',
 ' recruitment phase',
 ' venetoclax acalabrutinib works',
 ' MCL patients']

**Note** There still are element in the list which we can remove like `phase`,`patients` given more resource or time to figure out the pattern

## Objective 1
Create a csv file which contains entities used by the authors in various tweets with their frequency.

For doing this we will create a dictionary in which we will store all the entities as keys and their frequency as value. And then convert it into a dataframe which we can later store as csv file

In [45]:
### Creating empty dictionary
dictionary = {}

In [46]:
### Creating function to store entity into a dictionary with frequency
def count_entities(ent_list):
    for ent in ent_list:
        if ent in dictionary.keys():
            dictionary[ent] += 1
        else:
            dictionary[ent] = 1
    return dictionary

In [49]:
### Applying the count_entities function to Entity_list
tweet_df['Entity_list'].apply(count_entities);

In [None]:
dictionary

In [148]:
list1 = list(dictionary)

In [50]:
Obj1_df = pd.DataFrame([(i, j) for i, j in dictionary.items()], 
                   columns=['entity','frequency'])


In [None]:
### We can sort our dataframe by frequency by using the following line of code

# Obj1_df.sort_values(by ='frequency',ascending = True,ignore_index = True)

In [51]:
Obj1_df.head(10)

Unnamed: 0,entity,frequency
0,⚕ ️ Scientists,2
1,Phase II study,2
2,acalabrutinib,536
3,patients,3766
4,relapsedrefractory,70
5,phase,312
6,AcalabrutinibVenetoclax AV trial,2
7,recruitment phase,2
8,venetoclax acalabrutinib works,2
9,MCL patients,8


In [54]:
Obj1_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30054 entries, 0 to 30053
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   entity     30054 non-null  object
 1   frequency  30054 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 469.7+ KB


**We have 30054 entities in total**

Now we have our dataframe as required and we can now store it as a csv file

In [53]:
Obj1_df.to_csv('objective1.csv',index=False)

## Objective 2
Finding the polarity of each author towards an entity and then storing it in a csv

For this we will again create a dictionary but a nested dictionary with keys equal to entity and inner dictionary having keys equal to author and value equal to frequency of entity used by that author

In [55]:
### Creating a empty dictionary
d2 = {}

### Creating a function to rows passed as parameter
def auth_ent_freq(value):
    for ent in value.Entity_list:
        if ent in d2.keys():
            if value.tweet_author in d2[ent].keys():
                d2[ent][value.tweet_author] += 1
            else: 
                d2[ent][value.tweet_author] = 1
        else:
            d2[ent] = {value.tweet_author:1}
    return d2

In [56]:
tweet_df[['tweet_author','Entity_list']].apply(auth_ent_freq,axis = 1);

In [None]:
d2

Now we can store our dictionary as dataframe but first for that we need to conver it in form of an array a list of list which contains entity author and frequency so that it can be stored in one row.

list = [entity,author,frequency]

In [58]:
ar = []
for e,v in d2.items():
    for a,f in v.items():
        ls = [e,a,f]
        ar.append(ls)

In [59]:
obj2_df = pd.DataFrame(ar,columns = ["entity",'author','Frequency'])

In [60]:
obj2_df.head(10)

Unnamed: 0,entity,author,Frequency
0,⚕ ️ Scientists,Hematopoiesis News,1
1,Phase II study,Hematopoiesis News,1
2,acalabrutinib,Hematopoiesis News,1
3,acalabrutinib,Lymphoma Hub,10
4,acalabrutinib,Helen Oram,1
5,acalabrutinib,Paperbirds_Hematology,4
6,acalabrutinib,Cardio-Targets,2
7,acalabrutinib,Lymphoma Papers,3
8,acalabrutinib,"CancerNetwork®, Home of the Journal ONCOLOGY®",3
9,acalabrutinib,Medivizor,12


we have a dataframe where each attribute is stored with it's author and the frequency used by the author in one row

In [61]:
obj2_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64541 entries, 0 to 64540
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   entity     64541 non-null  object
 1   author     64541 non-null  object
 2   Frequency  64541 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.5+ MB


In [63]:
### Let's save a copy of dataframe before making any changes to original so that we can revert back if needed
copy_df = obj2_df.copy()

### Polarity analysis based on frequency

The main process to detect polarity begins now

For detecting polarity of an author towards an entity we will compare each frequency with the mean of author's frequency. If an author uses an entity more than the mean value of him/her using other entity we would consider him/her biased towards that entity.

In [65]:
### Using lambda function to comapare each frequency to the mean of the group
obj2_df['p1'] = obj2_df.groupby('author')['Frequency'].apply(lambda x: x >= x.mean())

In [66]:
obj2_df.head()

Unnamed: 0,entity,author,Frequency,p1
0,⚕ ️ Scientists,Hematopoiesis News,1,False
1,Phase II study,Hematopoiesis News,1,False
2,acalabrutinib,Hematopoiesis News,1,False
3,acalabrutinib,Lymphoma Hub,10,True
4,acalabrutinib,Helen Oram,1,False


**The values stored in p1 column are in numpy boolean data type which cannot be evaluated using if else statement to store value as positive or negative so we need to convert them into python boolean**

In [68]:
obj2_df['p2'] = obj2_df['p1'].astype('bool')

In [69]:
obj2_df.head()

Unnamed: 0,entity,author,Frequency,p1,p2
0,⚕ ️ Scientists,Hematopoiesis News,1,False,False
1,Phase II study,Hematopoiesis News,1,False,False
2,acalabrutinib,Hematopoiesis News,1,False,False
3,acalabrutinib,Lymphoma Hub,10,True,True
4,acalabrutinib,Helen Oram,1,False,False


Now finally we can apply if else statement to convert true to positive and false to negative using thr function polarity_val

In [70]:
def polarity_val(val):
    if (val == True):
        return "Positive"
    else:
        return "Negative"

In [73]:
obj2_df['overall_polarity'] = obj2_df['p2'].apply(polarity_val)

In [76]:
obj2_df.head()

Unnamed: 0,entity,author,Frequency,p1,p2,overall_polarity
0,⚕ ️ Scientists,Hematopoiesis News,1,False,False,Negative
1,Phase II study,Hematopoiesis News,1,False,False,Negative
2,acalabrutinib,Hematopoiesis News,1,False,False,Negative
3,acalabrutinib,Lymphoma Hub,10,True,True,Positive
4,acalabrutinib,Helen Oram,1,False,False,Negative
...,...,...,...,...,...,...
64536,follikuläres Lymphom belegt,IQWiG,1,True,True,Positive
64537,PTK EXPRESSION,Medibooks,1,False,False,Negative
64538,IMMUNOCHEMOTHERAPY,Medibooks,1,False,False,Negative
64539,OUTCOME,Medibooks,1,False,False,Negative


now let's drop off the p1 and p2 and Frequency columns from the dataframe

In [80]:
obj2_df.drop('Frequency',axis = 1,inplace = True)

In [77]:
obj2_df.drop(['p1','p2','Frequency'],axis=1,inplace = True)

In [81]:
### check the head
obj2_df.head()

Unnamed: 0,entity,author,overall_polarity
0,⚕ ️ Scientists,Hematopoiesis News,Negative
1,Phase II study,Hematopoiesis News,Negative
2,acalabrutinib,Hematopoiesis News,Negative
3,acalabrutinib,Lymphoma Hub,Positive
4,acalabrutinib,Helen Oram,Negative


In [82]:
### Now let's save it to a csv
obj2_df.to_csv('objective2.csv',index = False)

I think there are resources and algorithms available to do this whole process of objective 2 in a more straight forward way with lesses number of steps.I was unable to figure out that process so used this basic approach to reach my goal. I went through lots of resources regarding sentiment analysis to figure out a way to do this task and it was difficult to wrap my head around the concept. 

Learning to use spacy and trying to figure out pattern to detect enities took a lot of time leaving little time to work on objective 2