**<h2>Emotions Prediction based on content</h2>**
Using the dataset, Predict the emotions(happy, sad or angry) based on the contents. Analyze and clean the data using nlp techniques. Finally apply machine learning algorithm for prediction.
| Column | Description |
|--------|--------------|
|  content   |     Textual data  | 
|  emotion  |    happy, sad or angry   |  


In [1]:
import pandas as pd
import re
import spacy
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC

### Question 1: Data loading and observation

- Load the dataset named `train_emotion.csv` and store it in the variable `df`. 
- If the missing value is present, drop it.
- Calculate word coverage of first sentence present in the dataset and round off upto two decimal places.   Store it in the variable `coverage`.(Hint: Formula of word coverage = (number of words in sentence / number of unique words in sentence))
- Assign a copy of the dataframe to the variable `df_q1`.

In [2]:
df = pd.read_csv('train_emotion.csv')

In [3]:
df.dropna()

Unnamed: 0,content,emotion
0,I am not your toy.,angry
1,ugh. your so fake i bet if you look at the bot...,angry
2,"I smile to the camera, I smile to my friends, ...",sad
3,"If someone breaks your heart, just punch them ...",angry
4,Being grateful to those that harvest our garde...,happy
...,...,...
1830,"['You Hurt Me But I Still Love You.', 'True Lo...",sad
1831,"This man, is man, a man, good man, way man, to...",angry
1832,"Go Ahead, Judge Me. Just Remember To Be Perfec...",angry
1833,"I’m take a nice long shit, so don’t stress me !",angry


In [4]:

coverage = round(len(df['content'].iloc[0][0])/len(set(df['content'].iloc[0][0])),2)


In [5]:
print(coverage)

1.0


In [6]:
df_q1 = df.copy()

In [7]:
################ Don't edit or delete the cell, Run this cell to test your code  ################
try:
    q1 = [df_q1, coverage]
    from emotions_predict import emotions
    emotions.save_ans1(q1)
except:
    print("Please check if the variable to the question has been assigned properly")

Please check if the variable to the question has been assigned properly


### Question 2: Data preprocessing

- Create a new column `no_punct` which convert sentence into lower case and removes punctuation from the 'content' column.
- Load the 'spacy' model and store it in variable `nlp`.
- Complete the function `tokens` which takes text as input and returns list of tokenized word using spacy model.
- Apply the function on the `no_punct` column and store the result in a new column called `tokenized`.
- Complete the function `remove_stopwords` which takes text as input and returns list of text without stopwords.(Hint: use stopword library)
- Apply the function on the `tokenized` column and store the result in a new column called `rem_stop`.
- Assign a copy of the dataframe to the variable `df_q2`.

In [8]:
import string
def no_punct(text):
    return ''.join([word for word in text if word not in string.punctuation])
df['content']=df['content'].str.lower()
df['no_punct']=df['content'].apply(no_punct)

In [9]:
nlp=spacy.load('en_core_web_sm')
def tokens(text):
    doc=nlp(text)
    lister=[]
    for word in doc:
        lister.append(word.text)
    return lister
df['tokenized']=df['no_punct'].apply(tokens)
df['tokenized']


0                                 [i, am, not, your, toy]
1       [ugh, your, so, fake, i, bet, if, you, look, a...
2       [i, smile, to, the, camera, i, smile, to, my, ...
3       [if, someone, breaks, your, heart, just, punch...
4       [being, grateful, to, those, that, harvest, ou...
                              ...                        
1830    [you, hurt, me, but, i, still, love, you, true...
1831    [this, man, is, man, a, man, good, man, way, m...
1832    [go, ahead, judge, me, just, remember, to, be,...
1833    [i, ’m, take, a, nice, long, shit, so, do, n’t...
1834    [i, knw, u, hv, blocked, me, but, still, i, lu...
Name: tokenized, Length: 1835, dtype: object

In [10]:
df['tokenized']

0                                 [i, am, not, your, toy]
1       [ugh, your, so, fake, i, bet, if, you, look, a...
2       [i, smile, to, the, camera, i, smile, to, my, ...
3       [if, someone, breaks, your, heart, just, punch...
4       [being, grateful, to, those, that, harvest, ou...
                              ...                        
1830    [you, hurt, me, but, i, still, love, you, true...
1831    [this, man, is, man, a, man, good, man, way, m...
1832    [go, ahead, judge, me, just, remember, to, be,...
1833    [i, ’m, take, a, nice, long, shit, so, do, n’t...
1834    [i, knw, u, hv, blocked, me, but, still, i, lu...
Name: tokenized, Length: 1835, dtype: object

In [None]:
from nltk.corpus import stopwords
def remove_stopwords(text):
    new=[word for word in text if word not in stopwords.words('english')]
    return new
df['rem_stop']=df['tokenized'].apply(remove_stopwords)

In [None]:
df_q2 = df.copy()

In [None]:
#if stopwords are to be removed using spacy
def remove_stop(text):
    doc=nlp(' '.join(text))
    final=[word for word in doc if word.is_stop==False]
    return final
df['rem_stop_spacy']=df['tokenized'].apply(remove_stop)

In [None]:
df['rem_stop_spacy']

In [None]:
################ Don't edit or delete the cell, Run this cell to test your code  ################
try:
    q2 = [df_q2,nlp]
    from emotions_predict import emotions
    emotions.save_ans2(q2)
except:
    print("Please check if the variable to the question has been assigned properly")

### Question 3: Stemming and lemmatization

- Use snow ball Stemmer for performing the Stemming and set language to english alone. 
- Complete the function `snow_stem` which takes a text as input and returns the list of stemmed version of every word as output. 
- Apply the function on the `rem_stop` column and store the result in a new column called `Stem_Text`.
- Complete the function `lemma` which takes a text as input and returns join of lemmatized word using Wordnet lemmatizer.
- Apply function on the `rem_stop` column and store the result in a column called `lemmatized`.
- Assign a copy of the dataframe to the variable `df_q3`.

In [None]:
def snow_stem(text):
    snow=SnowballStemmer(language='english')
    stemmed=[snow.stem(word) for word in text]
    return stemmed
df['Stem_Text']=df['rem_stop'].apply(snow_stem)

In [None]:
def lemma(text):
    lem=WordNetLemmatizer()
    lemas=[lem.lemmatize(word) for word in text]
    return lemas
df['lemmatized']=df['rem_stop'].apply(lemma)

In [None]:
df_q3 = df.copy()

In [None]:
################ Don't edit or delete the cell, Run this cell to test your code  ################
try:
    q3 = df_q3
    from emotions_predict import emotions
    emotions.save_ans3(q3)
except:
    print("Please check if the variable to the question has been assigned properly")

### Question 4: Part of Speech tagging

- From 'df' dataframe create a new dataframe which contains only first 10 rows of content column and save it into the variable `new_df`.
- Complete the function `pos_tag` which takes text as input and returns list of pos tag.(Hint: use spacy model)
- Apply the function on the 'content' column and store the result in the new column `pos_tag`.
- Assign a copy of the new dataframe(new_df) to the variable `df_q4`.
- <b>Example</b>

| content | pos_tag |
|--------|--------------|
| Whatever makes you feel bad, leave it. Whateve...   |     [PRON, VERB, PRON, VERB, ADJ, PUNCT, VERB, PRO...  | 
|  ['Hating Me Won’T Make You Pretty.', 'Anger Is...   |    [X, PUNCT, VERB, PRON, PROPN, VERB, PRON, PROP...   |  

In [None]:
new_df = df.iloc[:10]

In [None]:
def pos_tag(text):
    pos_tagged=nltk.pos_tag(text)
    return pos_tagged
df['pos_tag']=df['content'].apply(pos_tag)

In [None]:
df_q4 = df.copy()

In [None]:
################ Don't edit or delete the cell, Run this cell to test your code  ################
try:
    q4 = df_q4
    from emotions_predict import emotions
    emotions.save_ans4(q4)
except:
    print("Please check if the variable to the question has been assigned properly")

### Question 5: Named Entity Recognition

- Complete the function `ents` which takes text as input and returns list of text and corresponding label.
- Apply the function in the new dataframe(new_df) `content` column and store the result into the new column `named_entity`.
- Assign a copy of the new dataframe to the variable `df_q5`.
- <b>Example</b>

| content | pos_tag | named_entity |
|--------|--------------|------------|
| LoVe ThE oNe WhO LoVeS YoU..... nOt ThE oNe Wh...   |     [NOUN, DET, NUM, PRON, VERB, PRON, PUNCT, PART...  | [(LoVe, ORG), (oNe, CARDINAL), (WhOm, PERSON)]  |
|  Whatever makes you feel bad, leave it. Whateve...   |    [PRON, VERB, PRON, VERB, ADJ, PUNCT, VERB, PRO...   |  [] |

In [None]:
def ents(text):
    doc=nlp(text)
    lister=[]
    for word in doc.ents:
        lister.append((word.text,word.label_))
    return lister
df['named_entity']=df['content'].apply(ents)

In [None]:
df['named_entity']

In [None]:
df_q5 = df.copy()

In [None]:
################ Don't edit or delete the cell, Run this cell to test your code  ################
try:
    q5 = df_q5
    from emotions_predict import emotions
    emotions.save_ans5(q5)
except:
    print("Please check if the variable to the question has been assigned properly")

### Question 6: Polarity score

- Using Sentiment Intensity Analyzer package,find the polarity score of the 'content' column in the new dataframe(new_df).
- Store the result into the new column `polarity_score`.
- Assign a copy of the new dataframe(new_df) to the variable `df_q6`.
- <b>Example</b>

| content | pos_tag | named_entity | polarity_score |
|--------|--------------|------------|---------------|
|  Whatever makes you feel bad, leave it. Whateve...   |    [PRON, VERB, PRON, VERB, ADJ, PUNCT, VERB, PRO...   |  [] | {'neg': 0.273, 'neu': 0.581, 'pos': 0.145, 'compound': -0.296} |

In [None]:
sia=SentimentIntensityAnalyzer()
df['polarity_score']= df['content'].apply(sia.polarity_scores)
df_q6=df.copy()

In [None]:
################ Don't edit or delete the cell, Run this cell to test your code  ################
try:
    q6 = df_q6
    from emotions_predict import emotions
    emotions.save_ans6(q6)
except:
    print("Please check if the variable to the question has been assigned properly")

### Question 7: Feature extraction & model selection

- Store `lemmatized` column of `df` dataframe into the variable X and `emotion` column into the variable `y`.
- Create an instance of Count Vectorization and store it in variable `count_vect`.
- Fit and transform the data extracted (X) and store it in `count_fit`.
- Convert it into dataframe and have column as its feature name and store it in the variable `words`.
- Split the dataset into training and testing named X_train, X_test, y_train, y_test with the newly transformed data and the emotions with a test size of 20% and random state set to 42.
- Fit the model using Linear svc and assign fine tuned model into the variable `model`. Predict it using X_test.
- Calculate the accuracy score and round off upto the two decimal places. Store it in variable `acc_score`. Get the classification report of the model and store it in variable `class_report`.

In [None]:
X = df['lemmatized']
y = df['emotion']

In [None]:
count_vect = CountVectorizer()
count_fit = count_vect.fit_transform(X)

words = pd.DataFrame(count_fit.toarray(),columns=count_vect.get_feature_names_out())
X_train,X_test,y_train,y_test=train_test_split(count_fit,y,test_size=0.2,random_state=42)

In [None]:
model = LinearSVC()
model.fit(X_train,y_train)

y_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,accuracy_score
acc_score = round(accuracy_score(y_test,y_pred),2)

class_report = classification_report(y_test,y_pred)

In [None]:
################ Don't edit or delete the cell, Run this cell to test your code  ################
try:
    q7 = [words,model,acc_score,class_report]
    from emotions_predict import emotions
    emotions.save_ans7(q7)
except:
    print("Please check if the variable to the question has been assigned properly")

### Question 8: Prediction

- Load the dataset named `validation_emotion.csv` and perform all the data pre-processing tasks in the same manner as the training data.  
- Predict the validation dataset and store the result in the variable `y_pred_val` as a list.

In [None]:
data_test=pd.read_csv('validation_emotion.csv')
data_test['content']=data_test['content'].str.lower()
data_test['no_punct']=data_test['content'].apply(no_punct)
data_test['tokenized']=data_test['no_punct'].apply(tokens)
data_test['rem_stop']=data_test['tokenized'].apply(remove_stopwords)
data_test['Stem_Text']=data_test['rem_stop'].apply(snow_stem)
data_test['lemmatized']=data_test['rem_stop'].apply(lemma)
data_test['pos_tag']=data_test['content'].apply(pos_tag)
data_test['named_entity']=data_test['content'].apply(ents)
X=data_test['lemmatized']
count_vect = CountVectorizer()
count_fit_test = count_vect.fit_transform(X)


In [None]:
y_pred_val = list(model.predict(count_fit_test))

In [None]:
################ Don't edit or delete the cell, Run this cell to test your code  ################
try:
    q8 = y_pred_val
    from emotions_predict import emotions
    emotions.save_ans8(q8)
except:
    print("Please check if the variable to the question has been assigned properly")

**Note : After you have finished solving the questions, please run the below cell to save your answers for testing.**

In [None]:
from emotions_predict import emotions
try:
    q1 = [df_q1, coverage]
    q2 = [df_q2,nlp]
    q3 = df_q3
    q4 = df_q4
    q5 = df_q5
    q6 = df_q6
    q7 = [words,model,acc_score,class_report]
    q8 = y_pred_val
    emotions.save_answer(q1,q2,q3,q4,q5,q6,q7,q8)
 
except:
    print("Assign the answers to all the variables properly.")
    emotions.remove_pickle()
    try:
        q1 = [df_q1, coverage]
        emotions.save_ans1(q1)
    except:
        pass
    try:
        q2 = [df_q2,nlp]
        emotions.save_ans2(q2)
    except:
        pass
    try:
        q3 = df_q3
        emotions.save_ans3(q3)
    except:
        pass
    try:
        q4 = df_q4
        emotions.save_ans4(q4)
    except:
        pass
    try:
        q5 = df_q5
        emotions.save_ans5(q5)
    except:
        pass
    try:
        q6 = df_q6
        emotions.save_ans6(q6)
    except:
        pass
    try:
        q7 = [words,model,acc_score,class_report]
        emotions.save_ans7(q7)
    except:
        pass
    try:
        q8 = y_pred_val
        emotions.save_ans8(q8)
    except:
        pass
    
    